Probability Two Books
Probability Two Books
Alexander Sokol
Anders Rønn-Nielsen
ISBN 978-87-7078-999-8
Contents
Preface v
3 Weak convergence 59
3.1 Weak convergence and convergence of measures . . . . . . . . . . . . . . . . . 60
3.2 Weak convergence and distribution functions . . . . . . . . . . . . . . . . . . 67
3.3 Weak convergence and convergence in probability . . . . . . . . . . . . . . . . 69
3.4 Weak convergence and characteristic functions . . . . . . . . . . . . . . . . . 72
3.5 Central limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.6 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.7 Higher dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5 Martingales 133
5.1 Introduction to martingale theory . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.2 Martingales and stopping times . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.3 The martingale convergence theorem . . . . . . . . . . . . . . . . . . . . . . . 145
5.4 Martingales and uniform integrability . . . . . . . . . . . . . . . . . . . . . . 151
5.5 The martingale central limit theorem . . . . . . . . . . . . . . . . . . . . . . . 164
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Bibliography 263
Preface
We have endeavoured throughout to present the material in a logical fashion, with detailed
proofs allowing the reader to perceive not only the big picture of the theory, but also to
understand the finer elements of the methods of proof used. Exercises are given at the
end of each chapter, with hints for the exercises given in the appendix. The exercises form
an important part of the monograph. We strongly recommend that any reader wishing to
acquire a sound understanding of the theory spends considerable time solving the exercises.
While we share the responsibility for the ultimate content of the monograph and in partic-
ular any mistakes therein, much of the material is based on books and lecture notes from
other individuals, in particular “Videregående sandsynlighedsregning” by Martin Jacobsen,
the lecture notes on weak convergence by Søren Tolver Jensen, “Sandsynlighedsregning på
Målteoretisk Grundlag” by Ernst Hansen as well as supplementary notes by Ernst Hansen,
in particular a note on the martingale central limit theorem. We are also indebted to Ketil
Biering Tvermosegaard, who diligently translated the lecture notes by Martin Jacobsen and
thus eased the migration of their contents to their present form in this monograph.
We would like to express our gratitude to our own teachers, particularly Ernst Hansen, Martin
Jacobsen and Søren Tolver Jensen, who taught us measure theory and probability theory.
vi CONTENTS
Also, many warm thanks go to Henrik Nygaard Jensen, who meticulously read large parts of
the manuscript and gave many useful comments.
Alexander Sokol
Anders Rønn-Nielsen
København, August 2012
Since the previous edition of the book, a number of misprints and errors have been cor-
rected, and various other minor amendments have been made. We are grateful to the many
students who have contributed to the monograph by identifying mistakes and suggesting
improvements.
Alexander Sokol
Anders Rønn-Nielsen
København, June 2013
Chapter 1
In this chapter, we will consider sequences of random variables and the basic results on such
sequences, in particular the strong law of large numbers, which formalizes the intuitive notion
that averages of independent and identically distributed events tend to the common mean.
We begin in Section 1.1 by reviewing the measure-theoretic preliminaries for our later results.
In Section 1.2, we discuss modes of convergence for sequences of random variables. The results
given in this section are fundamental to much of the remainder of this monograph, as well
as modern probability in general. In Section 1.3, we discuss the concept of independence
for families of σ-algebras, and as an application, we prove the Kolmogorov zero-one law,
which shows that for sequences of independent variables, events which, colloquially speaking,
depend only on the tail of the sequence either have probability zero or one. In Section 1.4,
we apply the results of the previous sections to prove criteria for the convergence of sums
of independent variables. Finally, in Section 1.5, we prove the strong law of large numbers,
arguably the most important result of this chapter.
As noted in the introduction, we assume given a level of familiarity with basic real analysis
and measure theory. Some of the main results assumed to be well-known in the following
are reviewed in Appendix A. In this section, we give an independent review of some basic
2 Sequences of random variables
Next, assume given a measure space (Ω, F), and let H be a set of subsets of Ω. We may then
form the set A of all σ-algebras on Ω containing H, this is a subset of the power set of the
power set of Ω. We may then define σ(H) = ∩F ∈A F, the intersection of all σ-algebras in A,
that is, the intersection of all σ-algebras containing H. This is a σ-algebra as well, and it is
the smallest σ-algebra on Ω containing H in the sense that for any σ-algebra G containing
H, we have G ∈ A and therefore σ(H) = ∩F ∈A F ⊆ G. We refer to σ(H) as the σ-algebra
generated by H, and we say that H is a generating family for σ(H).
Using this construction, we may define a particular σ-algebra on the Euclidean spaces: The
Borel σ-algebra Bd on Rd for d ≥ 1 is the smallest σ-algebra containing all open sets in Rd .
We denote the Borel σ-algebra on R by B.
Next, let (Fn )n≥1 be a sequence of sets in F. If Fn ⊆ Fn+1 for all n ≥ 1, we say that (Fn )n≥1
is increasing. If Fn ⊇ Fn+1 for all n ≥ 1, we say that (Fn )n≥1 is decreasing. Assume that
D is a set of subsets of Ω such that the following holds: Ω ∈ D, if F, G ∈ D with F ⊆ G
then G \ F ∈ D and if (Fn )n≥1 is an increasing sequence of sets in D then ∪∞ n=1 Fn ∈ D.
If this is the case, we say that D is a Dynkin class. Furthermore, if H is a set of subsets
of Ω such that whenever F, G ∈ H then F ∩ G ∈ H, then we say that H is stable under
finite intersections. These two concepts combine in the following useful manner, known as
Dynkin’s lemma: Let D be a Dynkin class on Ω, and H be a set of subsets of Ω which is
stable under finite intersections. If H ⊆ D, then σ(H) ⊆ D.
Dynkin’s lemma is useful when we desire to show that some property holds for all sets F ∈ F.
A consequence of Dynkin’s lemma is that if P and Q are two probability measures on F which
1.2 Convergence of sequences of random variables 3
are equal on a generating family for F which is stable under finite intersections, then P and
Q are equal on all of F.
Assume given a probability space (Ω, F, P ). The probability measure satisfies that for any
pair of events F, G ∈ F with F ⊆ G, P (G \ F ) = P (G) − P (F ). Also, if (Fn ) is an increasing
sequence in F, then P (∪∞ n=1 Fn ) = limn→∞ P (Fn ), and if (Fn ) is a decreasing sequence in
∞
F, then P (∩n=1 Fn ) = limn→∞ P (Fn ). These two properties are known as the upwards and
downwards continuity of probability measures, respectively.
Also, if (Xi )i∈I is a family of variables, we denote by σ((Xi )i∈I ) the σ-algebra generated
by (Xi )i∈I , meaning the smallest σ-algebra on Ω making Xi measurable for all i ∈ I, or
equivalently, the smallest σ-algebra containing H, where H is the class of subsets (Xi ∈ B)
for i ∈ I and B ∈ B. Also, for families of variables, we write (Xi )i∈I and (Xi ) interchangably,
understanding that the index set is implicit in the latter case.
We are now ready to introduce sequences of random variables and consider their modes of
convergence. For the remainder of the chapter, we work within the context of a probability
space (Ω, F, P ).
Definition 1.2.1. A sequence of random variables (Xn )n≥1 is a sequence of mappings from
Ω to R such that each Xn is a random variable.
If (Xn )n≥1 is a sequence of random variables, we also refer to (Xn )n≥1 as a discrete-time
4 Sequences of random variables
stochastic process, or simply a stochastic process. These names are interchangeable. For
brevity, we also write (Xn ) instead of (Xn )n≥1 . In Definition 1.2.1, all variables are assumed
to take values in R, in particular ruling out mappings taking the values ∞ or −∞ and ruling
out variables with values in Rd . This distinction is made solely for convenience, and if need
be, we will also refer to sequences of random variables with values in Rd or other measure
spaces as sequences of random variables.
A natural first question is when sequences of random variables exist with particular distri-
butions. For example, does there exists a sequence of variables (Xn ) such that (X1 , . . . , Xn )
are independent for all n ≥ 1 and such that for each n ≥ 1, Xn has some particular given
distribution? Such questions are important, and will be relevant for our later construction of
examples and counterexamples, but are not our main concern here. For completeness, results
which will be sufficient for our needs are given in Appendix A.3.
The following fundamental definition outlines the various modes of convergence of random
variables to be considered in the following.
Definition 1.2.2. Let (Xn ) be a sequence of random variables, and let X be some other
random variable.
P a.s. Lp D
In the affirmative, we write Xn −→ X, Xn −→ X, Xn −→ X and Xn −→ X, respectively.
Definition 1.2.2 defines four modes of convergence: Convergence in probability, almost sure
convergence, convergence in Lp and convergence in distribution. Convergence in distribution
of random variables is also known as convergence in law. Note that convergence in Lp as
given in Definition 1.2.2 is equivalent to convergence in k · kp in the seminormed vector
space Lp (Ω, F, P ), see Section A.2. In the remainder of this section, we will investigate
the connections between these modes of convergence. A first question regards almost sure
convergence. The statement that P (limn→∞ Xn = X) = 1 is to be understood as that the
set {ω ∈ Ω | Xn (ω) converges to X(ω)} has probability one. For this to make sense, it is
1.2 Convergence of sequences of random variables 5
necessary that this set is measurable. The following lemma ensures that this is always the
case. For the proof of the lemma, we recall that for any family (Fi )i∈I of subsets of Ω, it
holds that
∩i∈I Fi = {ω ∈ Ω | ∀ i ∈ I : ω ∈ Fi } (1.1)
∪i∈I Fi = {ω ∈ Ω | ∃i ∈ I : ω ∈ Fi }, (1.2)
demonstrating the connection between set intersection and the universal quantifier and the
connection between set union and the existential quantifier.
Lemma 1.2.3. Let (Xn ) be a sequence of random variables, and let X be some other variable.
The subset F of Ω given by F = {ω ∈ Ω | Xn (ω) converges to X(ω)} is F measurable. In
particular, it holds that
F = ∩∞ ∞ ∞
m=1 ∪n=1 ∩k=n (|Xk − X| ≤
1
m ). (1.3)
Proof. We first prove the equality (1.3), and to this end, we first show that for any sequence
(xn ) of real numbers and any real x, it holds that xn converges to x if and only if
1
∀ m ∈ N ∃ n ∈ N ∀ k ≥ n : |xk − x| ≤ m. (1.4)
It is immediate that (1.5) implies (1.4). We prove the converse implication. Therefore,
1
assume that (1.4) holds. Let ε > 0 be given. Pick a natural m ≥ 1 so large that m ≤ ε.
1
Using (1.4), take a natural n ≥ 1 such that for all k ≥ n, |xk − x| ≤ m . It then also holds
that for k ≥ n, |xk − x| ≤ ε. Therefore, (1.5) holds, and so (1.5) and (1.4) are equivalent.
F = ∩∞
m=1 {ω ∈ Ω | ∃ n ∈ N ∀ k ≥ n : |Xn (ω) − X(ω)| ≤
1
m}
= ∩∞ ∞
m=1 ∪n=1 {ω ∈ Ω | ∀ k ≥ n : |Xn (ω) − X(ω)| ≤
1
m}
= ∩∞ ∞ ∞
m=1 ∪n=1 ∩k=n {ω ∈ Ω | |Xn (ω) − X(ω)| ≤
1
m}
= ∩∞ ∞ ∞
m=1 ∪n=1 ∩k=n (|Xk − X| ≤
1
m ),
6 Sequences of random variables
as desired. We have now proved (1.3). Next, as Xk and X both are F measurable mappings,
1
|Xk − X| is F measurable as well, so set (|Xk − X| ≤ m ) is in F. As a consequence, we
∞ ∞ ∞ 1
obtain that ∩m=1 ∪n=1 ∩k=n (|Xk − X| ≤ m ) is an element of F. We conclude that F ∈ F,
as desired.
Lemma 1.2.3 ensures that the definition of almost sure convergence given in Definition 1.2.2
is well-formed. A second immediate question regards convergence in probability: Does it
matter whether we consider the limit of P (|Xn − X| ≥ ε) or P (|Xn − X| > ε)? The following
lemma shows that this is not the case.
Lemma 1.2.4. Let (Xn ) be a sequence of random variables, and let X be some other variable.
It holds that Xn converges in probability to X if and only if it holds that for each ε > 0,
limn→∞ P (|Xn − X| > ε) = 0.
Proof. First assume that for each ε > 0, limn→∞ P (|Xn −X| > ε) = 0. We need to show that
Xn converges in probability to X, meaning that for each ε > 0, limn→∞ P (|Xn −X| ≥ ε) = 0.
To prove this, first fix ε > 0. We then obtain
Also, we show that limits for three of the modes of convergence considered are almost surely
unique.
Lemma 1.2.5. Let (Xn ) be a sequence of random variables and let X and Y be two other
variables. Assume that Xn converges both to X and to Y in probability, almost surely or in
Lp for some p ≥ 1. Then X and Y are almost surely equal.
P P
Proof. First assume that Xn −→ X and Xn −→ Y . Fix ε > 0. Note that if |X − Xn | ≤ ε/2
and |Xn − Y | ≤ ε/2, we have |X − Y | ≤ ε. Therefore, we also find that |X − Y | > ε implies
1.2 Convergence of sequences of random variables 7
a.s. a.s.
In the case where Xn −→ X and Xn −→ Y , the result follows since limits in R are unique. If
Lp Lp
Xn −→ X and Xn −→ Y , we obtain kX − Y kp ≤ lim supn→∞ kX − Xn kp + kXn − Y kp = 0,
so E|X − Y |p = 0, yielding that X and Y are almost surely equal. Here, k · kp denotes the
seminorm on Lp (Ω, F, P ).
Having settled these preliminary questions, we next consider the question of whether some
of the modes of convergence imply another mode of convergence. Before proving our basic
theorem on this, we show a few lemmas of independent interest. In the following lemma,
f (X) denotes the random variable defined by f (X)(ω) = f (X(ω)).
Lemma 1.2.6. Let (Xn ) be a sequence of random variables, and let X be some other variable.
Let f : R → R be a continuous function. If Xn converges almost surely to X, then f (Xn )
converges almost surely to f (X). If Xn converges in probability to X, then f (Xn ) converges
in probability to f (X).
Proof. We first consider the case of almost sure convergence. Assume that Xn converges
almost surely to X. As f is continuous, we find for each ω that if Xn (ω) converges to X(ω),
f (Xn (ω)) converges to f (X(ω)) as well. Therefore,
proving the result. Next, we turn to the more difficult case of convergence in probability.
Assume that Xn converges in probability to X, we need to prove that f (Xn ) converges in
probability to f (X). Let ε > 0, we thus need to show limn→∞ P (|f (Xn ) − f (X)| > ε) = 0.
To this end, let m ≥ 1. As [−(m + 1), m + 1] is compact, f is uniformly continuous on this
set. Choose δ > 0 such that δ parries ε for this uniform continuity of f . We may assume
without loss of generality that δ ≤ 1. We then have that for x and y in [−(m + 1), m + 1],
|x − y| ≤ δ implies |f (x) − f (y)| ≤ ε. Now assume that |f (x) − f (y)| > ε. If |x − y| ≤ δ and
|x| ≤ m, we obtain x, y ∈ [−(m + 1), m + 1] and thus a contradiction with |f (x) − f (y)| > ε.
8 Sequences of random variables
Therefore, when |f (x) − f (y)| > ε, it must either hold that |x − y| > δ or |x| > m. This
yields
lim sup P (|f (Xn ) − f (X)| > ε) ≤ lim sup P (|Xn − X| > δ) + P (|X| > m) = P (|X| > m).
n→∞ n→∞
Lemma 1.2.7. Let X be a random variable, let p > 0 and let ε > 0. It then holds that
P (|X| ≥ ε) ≤ ε−p E|X|p .
Proof. We simply note that P (|X| ≥ ε) = E1(|X|≥ε) ≤ ε−p E|X|p 1(|X|≥ε) ≤ ε−p E|X|p , which
yields the result.
Theorem 1.2.8. Let (Xn ) be a sequence of random variables, and let X be some other
variable. If Xn converges in Lp to X for some p ≥ 1, or if Xn converges almost surely to
X, then Xn also converges in probability to X. If Xn converges in probability to X, then Xn
also converges in distribution to X.
Proof. We need to prove three implications. First assume that Xn converges in Lp to X for
some p ≥ 1, we want to show that Xn converges in probability to X. By Lemma 1.2.7, it
holds for any ε > 0 that
(|Xn − X| ≥ ε) ⊆ ∪∞ ∞
k=n (|Xk − X| ≥ ε) and that the sequence (∪k=n (|Xk − X| ≥ ε))n≥1 is
decreasing, we find
= P (∩∞ ∞
n=1 ∪k=n |Xk − X| ≥ ε)
By Lemma 1.2.6, f (Xn ) converges in probability to f (X). Therefore, (1.6) shows that
lim supn→∞ |Ef (Xn ) − Ef (X)| ≤ ε. As ε > 0 was arbitrary, this allows us to conclude
lim supn→∞ |Ef (Xn ) − Ef (X)| = 0, and as a consequence, limn→∞ Ef (Xn ) = Ef (X). This
proves the desired convergence in distribution of Xn to X.
Theorem 1.2.8 shows that among the four modes of convergence defined in Definition 1.2.2,
convergence in Lp and almost sure convergence are the strongest, convergence in probability
is weaker than both, and convergence in distribution is weaker still. There is no general
simple relationship between convergence in Lp and almost sure convergence. Note also an
essential difference between convergence in distribution and the other three modes of conver-
gence: While both convergence in Lp , almost sure convergence and convergence in probability
depend on the multivariate distribution of (Xn , X), convergence in distribution merely de-
pends on the marginal laws of Xn and X. For this reason, the theory for convergence in
distribution is somewhat different than the theory for the other three modes of convergence.
In the remainder of this chapter and the next, we only consider the other three modes of
convergence.
Example 1.2.9. Let ξ ∈ R, let σ > 0 and let (Xn ) be a sequence of random variables
such that for all n ≥ 1, Xn is normally distributed with mean ξ and variance σ 2 . Assume
Pn
furthermore that X1 , . . . , Xn are independent for all n ≥ 1. Put ξˆn = n1 k=1 Xk . We claim
that ξˆn converges in Lp to ξ for all p ≥ 1.
10 Sequences of random variables
Pn
To prove this, note that by the properties of normal distributions, n1 k=1 Xk is normally
√ Pn
distributed with mean ξ and variance n1 σ 2 . Therefore, nσ −1 (ξ − n1 k=1 Xk ) is standard
normally distributed. With mp denoting the p’th absolute moment of the standard normal
distribution, we thus obtain
n p n
!p
p
1 X σ √ −1 1 X σ p mp
E|ξ − ξˆn | = E ξ −
p
Xk = p/2 E nσ ξ− Xk = p/2 ,
n n n n
k=1 k=1
1
Pn Lp
which converges to zero, proving that n k=1 Xk −→ ξ for all p ≥ 1. ◦
The following lemma shows that almost sure convergence and convergence in probability
enjoy strong stability properties.
Lemma 1.2.10. Let (Xn ) and (Yn ) be sequences of random variables, and let X and Y
be two other random variables. If Xn converges in probability to X and Yn converges in
probability to Y , then Xn + Yn converges in probability to X + Y , and Xn Yn converges in
probability to XY . Also, if Xn converges almost surely to X and Yn converges almost surely
to Y , then Xn + Yn converges almost surely to X + Y , and Xn Yn converges almost surely to
XY .
Proof. We first show the claims for almost sure convergence. Assume that Xn converges
almost surely to X and that Yn converges almost surely to Y . Note that as addition is
continuous, we have that whenever Xn (ω) converges to X(ω) and Yn (ω) converges to Y (ω),
it also holds that Xn (ω) + Yn (ω) converges to X(ω) + Y (ω). Therefore,
since the intersection of two almost sure sets also is an almost sure set. This proves the
claims on almost sure convergence. Next, assume that Xn converges in probability to X and
that Yn converges in probability to Y . We first show that Xn + Yn converges in probability
to X + Y . Let ε > 0 be given. We then obtain
proving the claim. Finally, we show that Xn Yn converges in probability to XY . This will
follow if we show that Xn Yn − XY converges in probability to zero. To this end, we note the
1.2 Convergence of sequences of random variables 11
Taking the limes superior, we conclude limn→∞ P (|(Xn − X)(Yn − Y )| ≥ ε) = 0, and thus
(Xn − X)(Yn − Y ) converges in probability to zero. Next, we show that (Xn − X)Y converges
in probability to zero. Again, let ε > 0. Consider also some m ≥ 1. We then obtain
P (|(Xn − X)Y | ≥ ε)
= P ((|(Xn − X)Y | ≥ ε) ∩ (|Y | ≤ m)) + P ((|(Xn − X)Y | ≥ ε) ∩ (|Y | > m))
1
≤ P (m|Xn − X| ≥ ε) + P (|Y | > m) = P (|Xn − X| ≥ m ε) + P (|Y | > m).
Therefore, we obtain lim supn→∞ P (|(Xn − X)Y | ≥ ε) ≤ P (|Y | > m) for all m ≥ 1, from
which we conclude lim supn→∞ P (|(Xn − X)Y | ≥ ε) ≤ limm→∞ P (|Y | > m) = 0, by down-
wards continuity. This shows that (Xn − X)Y converges in probability to zero. By a similar
argument, we also conclude that (Yn − Y )X converges in probability to zero. Combining our
results, we conclude that Xn Yn converges in probability to XY , as desired.
Lemma 1.2.10 could also have been proven using a multidimensional version of Lemma 1.2.6
and the continuity of addition and multiplication. Our next goal is to prove another con-
nection between two of the modes of convergence, namely that convergence in probability
implies almost sure convergence along a subsequence, and use this to show completeness
properties of each of our three modes of convergence, in the sense that we wish to argue that
Cauchy sequences are convergent for both convergence in Lp , almost sure convergence and
convergence in probability.
We begin by showing the Borel-Cantelli lemma, a general result which will be useful in several
contexts. Let (Fn ) be a sequence of events. We then define
Note that ω ∈ Fn for infinitely many n if and only if for each n ≥ 1, there exists k ≥ n
such that ω ∈ Fk . Likewise, it holds that ω ∈ Fn eventually if and only if there exists n ≥ 1
12 Sequences of random variables
Proof. As the sequence of sets (∪∞k=n Fk )n≥1 is decreasing, we obtain by the downward con-
tinuity of probability measures that
∞
X
P (Fn i.o.) = P (∩∞ ∞ ∞
n=1 ∪k=n Fk ) = lim P (∪k=n Fk ) ≤ lim P (Fk ) = 0,
n→∞ n→∞
k=n
with the final equality holding since the tail sum of a convergent series always tends to
zero.
Lemma 1.2.12. Let (Xn ) be a sequence of random variables, and let X be some other
P∞
variable. Assume that for all ε > 0, n=1 P (|Xn − X| ≥ ε) is finite. Then Xn converges
almost surely to X.
P (∩∞ ∞ ∞
m=1 ∪n=1 ∩k=n (|Xk − X| ≤
1
m )) = 1.
Fix ε > 0. By Lemma 1.2.11, we find that the set (|Xn − X| ≥ ε i.o.) has probability zero.
As (|Xn − X| < ε evt.)c = (|Xn − X| ≥ ε i.o.), we obtain P (|Xn − X| < ε evt.) = 1. As
1
ε > 0 was arbitrary, we in particular obtain P (|Xn − X| ≤ m evt.) = 1 for all m ≥ 1. As the
intersection of a countable family of almost sure events again is an almost sure event, this
yields
P (∩∞ ∞ ∞
m=1 ∪n=1 ∩k=n (|Xk − X| ≤
1
m )) = P (∩∞
m=1 (|Xk − X| ≤
1
m evt.)) = 1,
as desired.
Lemma 1.2.13. Let (Xn ) be a sequence of random variables, and let X be some other
variable. Assume that Xn converges in probability to X. There is a subsequence (Xnk )
converging almost surely to X.
1.2 Convergence of sequences of random variables 13
Proof. Let (εk )k≥1 be a sequence of positive numbers decreasing to zero. For each k, it holds
that limn→∞ P (|Xn − X| ≥ εk ) = 0. In particular, for any k, n∗ ≥ 1, we may always pick
n > n∗ such that P (|Xn − X| ≥ εk ) ≤ 2−k . Therefore, we may recursively define a strictly
increasing sequence of indicies (nk )k≥1 such that for each k, P (|Xnk − X| ≥ εk ) ≤ 2−k . We
claim that the sequence (Xnk )k≥1 satisfies the criterion of Lemma 1.2.12. To see this, let
ε > 0. As (εk )k≥1 decreases to zero, there is m such that for k ≥ m, εk ≤ ε. We then obtain
∞
X ∞
X ∞
X
P (|Xnk − X| ≥ ε) ≤ P (|Xnk − X| ≥ εk ) ≤ 2−k ,
k=m k=m k=m
P∞
which is finite. Hence, k=1 P (|Xnk − X| ≥ ε) is also finite, and Lemma 1.2.12 then shows
that Xnk converges almost surely to X.
We are now almost ready to introduce the concept of being Cauchy with respect to each of
our three modes of convergence and show that being Cauchy implies convergence.
Lemma 1.2.14. Let (Xn ) be a sequence of random variables. It then holds that
(Xn is Cauchy) = ∩∞ ∞ ∞
m=1 ∪n=1 ∩k=n (|Xn − Xk | ≤
1
m ), (1.7)
To this end, first assume that (1.8) holds. Let m ∈ N be given and choose ε > 0 so small
1
that ε ≤ m . Using (1.8), take n ∈ N so that |xk − xi | ≤ ε whenever k, i ≥ n. Then it holds in
1
particular that |xk − xn | ≤ m . Thus, (1.9) holds. To prove the converse implication, assume
1
that (1.9) holds. Let ε > 0 be given and take m ∈ N so large that m ≤ ε/2. Using (1.9),
1
take n ∈ N so that for all k ≥ n, |xk − xn | ≤ m . We then obtain that for all k, i ≥ n, it holds
2
that |xk − xi | ≤ |xk − xn | + |xi − xn | ≤ m ≤ ε. We conclude that (1.8) holds. We have now
shown that (1.8) and (1.9) are equivalent.
We are now ready to define what it means to be Cauchy with respect to each of our modes
of convergence. In the definition, we use the convention that a double sequence (xnm )n,m≥1
converges to x as n and m tend to infinity if it holds that for all ε > 0, there is k ≥ 1 such
that |xnm − x| ≤ ε whenever n, m ≥ k. In particular, a sequence (xn )n≥1 is Cauchy if and
only if |xn − xm | tends to zero as n and m tend to infinity.
Definition 1.2.15. Let (Xn ) be a sequence of random variables. We say that Xn is Cauchy
in probability if it holds for any ε > 0 that P (|Xn − Xm | ≥ ε) tends to zero as m and n tend
to infinity. We say that Xn is almost surely Cauchy if P ((Xn ) is Cauchy) = 1. Finally, we
say that Xn is Cauchy in Lp for some p ≥ 1 if E|Xn − Xm |p tends to zero as m and n tend
to infinity.
Note that Lemma 1.2.14 ensures that the definition of being almost surely Cauchy is well-
formed, since (Xn is Cauchy) is measurable.
Proof. The result on sequences which are Cauchy in Lp is immediate from Fischer’s com-
pleteness theorem, so we merely need to show the results for being Cauchy in probability and
being almost surely Cauchy.
Consider the case where Xn is almost surely Cauchy. As R equipped with the Euclidean
metric is complete, (Xn is convergent) = (Xn is Cauchy), so in particular, by Lemma 1.2.14,
the former is a measurable almost sure set. Define X by letting X = limn→∞ Xn when the
limit exists and zero otherwise. Then X is measurable, and we have
so Xn converges almost surely to X, proving the result for being almost surely Cauchy.
Finally, assume that Xn is Cauchy in probability. For each k, P (|Xn − Xm | ≥ 2−k ) tends
to zero as m and n tend to infinity. In particular, we find that for each k, there is n∗
1.3 Independence and Kolmogorov’s zero-one law 15
such that for n, m ≥ n∗ , it holds that P (|Xn − Xm | ≥ 2−k ) ≤ 2−k . Therefore, we may
pick a sequence of strictly increasing indicies (nk ) such that P (|Xn − Xm | ≥ 2−k ) ≤ 2−k
for n, m ≥ nk . We then obtain in particular that P (|Xnk+1 − Xnk | ≥ 2−k ) ≤ 2−k for all
P∞
k ≥ 1. From this, we find that k=1 P (|Xnk+1 − Xnk | ≥ 2−k ) is finite, so by Lemma 1.2.11,
P (|Xnk+1 − Xnk | ≥ 2−k i.o.) = 0, leading to P (|Xnk+1 − Xnk | < 2−k evt.) = 1. In particular,
P∞
it holds almost surely that k=1 |Xnk+1 − Xnk | is finite. For any k > i ≥ 1, we have
k−1
X ∞
X
|Xnk − Xni | ≤ |Xnj+1 − Xnj | ≤ |Xnj+1 − Xnj |.
j=i j=i
Now, as the tail sums of convergent sums tend to zero, the above shows that on the almost
P∞
sure set where k=1 |Xnk+1 − Xnk | is finite, (Xnk )k≥1 is Cauchy. In particular, (Xnk )k≥1
is almost surely Cauchy, so by what was already shown, there exists a variable X such that
Xnk converges almost surely to X. In order to complete the proof, we will argue that (Xn )
converges in probability to X. To this end, fix ε > 0. Let δ > 0. As (Xn ) is Cauchy in
probability, there is n∗ such that for m, n ≥ n∗ , P (|Xn − Xm | ≥ 2ε ) ≤ δ. And as Xnk
converges almost surely to X, Xnk also converges in probability to X by Theorem 1.2.8.
Therefore, for k large enough, P (|Xnk − X| ≥ 2ε ) ≤ δ. Let k be so large that this holds and
simultaneously so large that nk ≥ n∗ . We then obtain for n ≥ n∗ that
Thus, for n large enough, P (|Xn − X| ≥ ε) ≤ 2δ. As δ was arbitrary, we conclude that
limn→∞ P (|Xn −X| ≥ ε) = 0, showing that Xn converges in probability to X. This concludes
the proof.
In this section, we generalize the classical notion of independence of random variables and
events to a notion of independence of σ-algebras. This general notion of independence en-
compasses all types of independence which will be relevant to us.
Definition 1.3.1. Let I be some set and let (Fi )i∈I be a family of σ-algebras. We say that
the family of σ-algebras is independent if it holds for any finite sequence of distinct indicies
16 Sequences of random variables
The abstract definition in Definition 1.3.1 will allow us considerable convenience as regards
matters of independence. The following lemma shows that when we wish to prove indepen-
dence, it suffices to prove the equality (1.10) for generating families which are stable under
finite intersections.
Lemma 1.3.2. Let I be some set and let (Fi )i∈I be a family of σ-algebras. Assume that for
each i, Fi = σ(Hi ), where Hi is a set family which is stable under finite intersections. If it
Qn
holds for any finite sequence of distinct indicies i1 , . . . , in ∈ I that P (∩nk=1 Fk ) = k=1 P (Fk ),
where F1 ∈ Hi1 , . . . , Fn ∈ Hin , then (Fi )i∈I is independent.
Proof. We apply Dynkin’s lemma and an induction proof. We wish to show that for each n,
it holds for all sequences of n distinct indicies i1 , . . . , in ∈ I and all finite sequences of sets
Qn
F1 ∈ Fi1 , . . . , Fn ∈ Fin that P (∩nk=1 Fk ) = k=1 P (Fk ). The induction start is trivial, so it
suffices to show the induction step. Assume that the result holds for n, we wish to prove it
for n + 1. Fix a finite sequence of n + 1 distinct indicies i1 , . . . , in+1 ∈ I. We wish to show
that
n+1
Y
P (∩n+1
k=1 Fk ) = P (Fk ) (1.11)
k=1
for F1 ∈ Fi1 , . . . , Fn+1 ∈ Fin+1 . To this end, let k ≤ n + 1, and let Fj ∈ Fij for j 6= k. Define
n+1
Y
D = Fk ∈ Fik P (∩n+1 j=1 Fj ) = P (Fj ) . (1.12)
j=1
We claim that D is a Dynkin class. To see this, we need to prove that Ω ∈ D, that B \ A ∈ D
whenever A ⊆ B and A, B ∈ D, and that whenever (An ) is an increasing sequence in D,
∪∞
n=1 An ∈ D as well. By our induction assumption, Ω ∈ D. Let A, B ∈ D with A ⊆ B. We
then obtain
P ((∪∞ ∞
n=1 An ) ∩ ∩j6=k Fj ) = P (∪n=1 An ∩ ∩j6=k Fj ) = lim P (An ∩ ∩j6=k Fj )
n→∞
Y Y
= lim P (An ) P (Fj ) = P (∪∞
n=1 An ) P (Fj ),
n→∞
j6=k j6=k
so ∪∞
n=1 An ∈ D. This shows that D is a Dynkin class.
We are now ready to argue that (1.11) holds. Note that by our assumption, we know that
(1.11) holds for F1 ∈ Hi1 , . . . , Fn+1 ∈ Hin+1 . Consider F2 ∈ Hi2 , . . . , Fn+1 ∈ Hin+1 . The
family D as defined in (1.12) then contains Hi1 , and so Dynkin’s lemma yields Fi1 = σ(Hi1 ) ⊆
D. This shows that (1.11) holds when F1 ∈ Fi1 and F2 ∈ Hi2 , . . . , Fn+1 ∈ Hin+1 . Next, let
F1 ∈ Fi1 and consider a finite sequence of sets F3 ∈ Hi3 , . . . , Fn+1 ∈ Hin+1 . Then D as defined
in (1.12) contains Hi2 , and therefore by Dynkin’s lemma contains σ(Hi2 ) = Fi2 , proving that
(1.11) holds when F1 ∈ Fi1 , F2 ∈ Fi2 and F3 ∈ Hi3 , . . . , Fn+1 ∈ Hin+1 . By a finite induction
argument, we conclude that (1.11) in fact holds when F1 ∈ Fi1 , . . . , Fn+1 ∈ Fin+1 , as desired.
This proves the induction step and thus concludes the proof.
The following definition shows how we may define independence between families of variables
and families of events from Definition 1.3.1.
Definition 1.3.3. Let I be some set and let (Xi )i∈I be a family of random variables. We
say that the family is independent when the family of σ-algebras (σ(Xi ))i∈I is independent.
Also, if (Fi )i∈I is a family of events, we say that the family is independent when the family
of σ-algebras (σ(1Fi ))i∈I is independent.
Next, we show that Definition 1.3.3 agrees with our usual definitions of independence.
Lemma 1.3.4. Let I be some set and let (Xi )i∈I be a family of random variables. The family
is independent if and only if it holds for any finite sequence of distinct indicies i1 , . . . , in ∈ I
Qn
and any A1 , . . . , An ∈ B that P (∩nk=1 (Xik ∈ Ak )) = k=1 P (Xik ∈ Ak ).
Proof. From Definition 1.3.3, we have that (Xi )i∈I is independent if and only if (σ(Xi ))i∈I is
independent, which by Definition 1.3.1 is the case if and only if for any finite sequence
of distinct indicies i1 , . . . , in ∈ I and any F1 ∈ σ(Xi1 ), . . . , Fn ∈ σ(Xin ) it holds that
Qn
P (∩nk=1 Fk ) = k=1 P (Fk ). However, we have σ(Xi ) = {(Xi ∈ A) | A ∈ B} for all i ∈ I, so
Qn
the condition is equivalent to requiring that P (∩nk=1 (Xik ∈ Ak )) = k=1 P (Xik ∈ Ak ) for
any finite sequence of distinct indicies i1 , . . . , in ∈ I and any A1 , . . . , An ∈ B. This proves
the claim.
18 Sequences of random variables
Lemma 1.3.5. Let I be some set and let (Fi )i∈I be a family of events. The family is
independent if and only if it holds for any finite sequence of distinct indicies i1 , . . . , in ∈ I
Qn
that P (∩nk=1 Fik ) = k=1 P (Fik ).
Proof. From Definition 1.3.3, (Fi )i∈I is independent if and only if (σ(1Fi ))i∈I is independent.
Note that for all i ∈ I, σ(1Fi ) = {Ω, ∅, Fi , Fic }, so σ(1Fi ) is generated by {Fi }. Therefore,
Lemma 1.3.2 yields the conclusion.
Lemma 1.3.7. Let I be some set and let (Xi )i∈I be a family of independent variables. For
each i, let ψi : R → R be some measurable mapping. Then (ψi (Xi ))i∈I is also independent.
Lemma 1.3.8. Let I be some set and let (Fi )i∈I be an independent family of σ-algebras.
Let J, J 0 ⊆ I and assume that J and J 0 are disjoint. Then, the σ-algebras σ((Fi )i∈J ) and
σ((Fi )i∈J 0 ) are independent.
Then H and H0 are generating families for G and G 0 , respectively, stable under finite in-
tersections. Now let F ∈ H and G ∈ H0 . Then, there exists n, n0 ≥ 1, i1 , . . . , in ∈ J
and i01 , . . . , i0n ∈ J 0 and F1 ∈ Fi1 , . . . , Fn ∈ Fin and G1 ∈ Fi01 , . . . , Gn ∈ Fi0n0 such that
0
F = ∩nk=1 Fk and G = ∩nk=1 Gk . Since J and J 0 are disjoint, the sequence i1 , . . . , in , i01 , . . . , i0n0
consists of distinct indicies. As (Fi )i∈I is independent, we then obtain
n
! n0
0 Y Y
P (F ∩ G) = P ((∩nk=1 Fk ) ∩ (∩nk=1 Gk )) = P (Fk ) P (Gk ) = P (F )P (G).
k=1 k=1
1.3 Independence and Kolmogorov’s zero-one law 19
Before ending the section, we show some useful results where independence is involved.
Definition 1.3.9. Let (Xn ) be a sequence of random variables. The tail σ-algebra of (Xn )
is defined as the σ-algebra ∩∞
n=1 σ(Xn , Xn+1 , . . .).
Colloquially speaking, the tail σ-algebra of (Xn ) consists of events which only depend on the
tail properties of (Xn ). For example, as we will see shortly, the set where (Xn ) is convergent
is an element in the tail σ-algebra.
Theorem 1.3.10 (Kolmogorov’s zero-one law). Let (Xn ) be a sequence of independent vari-
ables. Let J be the tail σ-algebra of (Xn ). For each F ∈ J , it holds that either P (F ) = 0 or
P (F ) = 1.
P ((B \ A) ∩ F ) = P (B ∩ Ac ∩ F ) = P ((B ∩ F ) ∩ (A ∩ F )c )
= P (B ∩ F ) − P (A ∩ F ) = P (B)P (F ) − P (A)P (F ) = P (B \ A)P (F ),
P ((∪∞ ∞
n=1 Bn ) ∩ F ) = P (∪n=1 Bn ∩ F ) = lim P (Bn ∩ F )
n→∞
proving that ∪∞ n=1 Bn ∈ D. We have now shown that D is a Dynkin class. Now fix n ≥ 2.
As F ∈ J , it holds that F ∈ σ(Xn , Xn+1 , . . .). Since the sequence (Xn ) is independent,
Lemma 1.3.8 shows that σ(Xn , Xn+1 , . . .) is independent of σ(X1 , . . . , Xn−1 ). Therefore,
σ(X1 , . . . , Xn−1 ) ∈ D for all n ≥ 2. As the family ∪∞
n=1 σ(X1 , . . . , Xn ) is a generating family
for σ(X1 , X2 , . . .) which is stable under finite intersections, Dynkin’s lemma allows us to
conclude σ(X1 , X2 , . . .) ⊆ D. From this, we obtain J ⊆ D, so F ∈ D. Thus, for any F ∈ J ,
it holds that P (F ) = P (F ∩ F ) = P (F )2 , yielding that P (F ) = 0 or P (F ) = 1.
Example 1.3.11. Let (Xn ) be a sequence of independent variables. Recalling Lemma 1.2.14,
20 Sequences of random variables
which is in σ(Xk , Xk+1 , . . .). As k was arbitrary, we find that ((Xn )n≥1 is convergent) is in
the tail σ-algebra of (Xn ). Thus, Theorem 1.3.10 allows us to conclude that the probability
of (Xn ) being convergent is either zero or one. ◦
Combining Theorem 1.3.10 and Lemma 1.2.11, we obtain the following useful result.
Lemma 1.3.12 (Second Borel-Cantelli). Let (Fn ) be a sequence of independent events. Then
P∞
P (Fn i.o.) is either zero or one, and the probability is zero if and only if n=1 P (Fn ) is finite.
Proof. Let J be the tail-σ-algebra of the sequence (1Fn ) of variables, Theorem 1.3.10 then
shows that J only contains sets of probability zero or one. Note that for any m ≥ 1, we
have (Fn i.o.) = ∩∞ ∞ ∞ ∞
n=1 ∪k=n Fk = ∩n=m ∪k=n Fk , so (Fn i.o.) is in J . Hence, Theorem 1.3.10
shows that P (Fn i.o.) is either zero or one.
As regards the criterion for the probability to be zero, note that from Lemma 1.2.11, we
P∞
know that if n=1 P (Fn ) is finite, then P (Fn i.o.) = 0. We need to show the converse,
P∞
namely that if P (Fn i.o.) = 0, then n=1 P (Fn ) is finite. This is equivalent to showing that
P∞
if P (Fn ) is infinite, then P (Fn i.o.) 6= 0. And to prove this, it suffices to show that if
P∞ n=1
n=1 P (F n ) is infinite, then P (Fn i.o.) = 1.
P∞
Assume that n=1 P (Fn ) is infinite. As it holds that (Fn i.o.)c = (Fnc evt.), it suffices to
show P (Fnc evt.) = 0. To do so, we note that since the sequence (Fn ) is independent, Lemma
1.3.7 shows that the sequence (Fnc ) is independent as well. Therefore,
since the sequence (∩ik=n Fkc )i≥1 is decreasing. Next, note that for x ≥ 0, we have
Z x Z x Z x
d
−x = (−1) dy ≤ (− exp(−y)) dy = exp(−y) dy = exp(−x) − 1,
0 0 0 dy
1.4 Convergence of sums of independent variables 21
In this section, we consider a sequence of independent variables (Xn ) and investigate when
Pn
the sum k=1 Xk converges as n tends to infinity. During the course of this section, we will
Pn Pn
encounter sequences (xn ) such that k=1 xk converges, while k=1 |xk | may not converge,
P∞
that is, series which are convergent but not absolutely convergent. In such cases, k=1 xk
is not always well-defined. However, for notational convenience, we will apply the following
P∞ Pn
convention: For a sequence (xn ), we say that k=1 xk converges when limn→∞ k=1 xk
P∞ Pn
exists, and say that k=1 xk diverges when limn→∞ k=1 xk does not exist, and in the
P∞
latter case, k=1 xk is undefined. With these conventions, we can say that we in this section
P∞
seek to understand when n=1 Xn converges for a sequence (Xn ) of independent variables.
Our first result is an example of a maximal inequality, that is, an inequality which yields
bounds on the distribution of a maximum of random variables. We will use this result to
prove a sufficient criteria for a sum of variables to converge almost surely and in L2 . Note
that in the following, just as we write EX for the expectation of a random variable X, we
write V X for the variance of X.
Theorem 1.4.1 (Kolmogorov’s maximal inequality). Let (Xk )1≤k≤n be a finite sequence of
independent random variables with mean zero and finite variance. It then holds that
k
! n
X 1 X
P max Xi ≥ ε ≤ 2 V Xk .
1≤k≤n
i=1
ε
k=1
Pk
Proof. Define Sk = i=1 Xi , we may then state the desired inequality as
Let T = min{1 ≤ k ≤ n | |Sk | ≥ ε}, with the convention that the minimum of the empty
set is ∞. Colloquially speaking, T is the first time where the sequence (Sk )1≤k≤n takes an
absolute value equal to or greater than ε. Note that T takes its values in {1, . . . , n} ∪ {∞}.
And for each k ≤ n, it holds that (T ≤ k) = ∪ki=1 (|Si | ≥ ε), so in particular T is measurable.
Now, (max1≤k≤n |Sk | ≥ ε) = ∪nk=1 (|Sk | ≥ ε) = (T ≤ n). Also, whenever T is finite, it holds
that |ST | ≥ ε, so that 1 ≤ ε−2 ST2 . Therefore, we obtain
EXk Xi 1(T ≥k) 1(T ≥i) = E(Xi )EXk 1(T ≥k) 1(T ≥i) = 0, (1.16)
since Xi has mean zero. Collecting our conclusions from (1.14), (1.15) and (1.16), we obtain
Pn Pn
P (max1≤k≤n |Sk | ≥ ε) ≤ ε−2 k=1 EXk2 = ε−2 k=1 V Xk = ε−2 V Sn , as desired.
Proof. For any sequence (xn ) in R, it holds that (xn ) is Cauchy if and only if for each m ≥ 1,
1
Pn
there is n ≥ 1 such that whenever k ≥ n+1, it holds that |xk −xn | < m . Put Sn = k=1 Xk .
We show that Sn is almost surely convergent. We have
As the intersection of a countable family of almost sure sets again is an almost sure set,
we find that in order to show almost sure convergence of Sn , it suffices to show that for
each m ≥ 1, ∪∞ 1
n=1 (supk≥n+1 |Sk − Sn | ≤ m ) is an almost sure set. However, we have
P (∪∞ 1 1
n=1 (supk≥n+1 |Sk − Sn | ≤ m )) ≥ P (supk≥i+1 |Sk − Si | ≤ m ) for all i ≥ 1, yielding
P (∪∞ 1 1
n=1 (supk≥n+1 |Sk − Sn | ≤ m )) ≥ lim inf n→∞ P (supk≥n+1 |Sk − Sn | ≤ m ). Combining
our conclusions, we find that in order to show the desired almost sure convergence of Sn , it
1
suffices to show limn→∞ P (supk≥n+1 |Sk − Sn | ≤ m ) = 1 for all m ≥ 1, which is equivalent
to showing
1
lim P (supk≥n+1 |Sk − Sn | > m) =0 (1.17)
n→∞
for all m ≥ 1. We wish to apply Theorem 1.4.1 to show (1.17). To do so, we first note that
1
since the sequence (maxn+1≤i≤k |Si − Sn | > m )k≥n+1 is increasing in k. Applying Theorem
1.4.1 to the independent variables Xn+1 , . . . , Xk with mean zero, we find, for k ≥ n + 1,
i k
X 1 X
1
P (maxn+1≤i≤k |Si − Sn | > m ) = P max Xj > ≤ (1/m)−2 V Xi .
n+1≤i≤k
j=n+1
m i=n+1
k
X
P (supk≥n+1 |Sk − Sn | > 1
m) ≤ lim (1/m)−2 V Xi
k→∞
i=n+1
k
X ∞
X
= lim (1/m)−2 V Xi = (1/m)−2 V Xi .
k→∞
i=n+1 i=n+1
P∞
As the series n=1 V Xn is assumed convergent, the tail sums converge to zero and we finally
1
obtain limn→∞ P (supk≥n+1 |Sk − Sn | > m ) = 0, which is precisely (1.17). Thus, by our
previous deliberations, we may now conclude that Sn is almost surely convergent. It remains
to prove convergence in L2 . Let S∞ be the almost sure limit of Sn , we will show that Sn
also converges in L2 to S∞ . By an application of Fatou’s lemma, we get
Recalling that the sequence (Xn ) consists of independent variables with mean zero, we obtain
k
!2 k k k k
X X X X X
E Xi =E Xi Xj = EXi2 = V Xi2 . (1.20)
i=n+1 i=n+1 j=n+1 i=n+1 i=n+1
P∞
Combining (1.19) and (1.20), we get E(Sn − S∞ )2 ≤ i=n+1 V Xi2 . As the series is conver-
gent, the tail sums converge to zero, so we conclude limn→∞ E(Sn − S∞ )2 = 0. This proves
convergence in L2 and so completes the proof.
P∞
Proof. First note that as n=1 P (|Xn | > ε) is finite, we have P (|Xn | > ε i.o.) = 0 by Lemma
1.2.11, which allows us to conclude P (Xn ≤ ε evt.) = P ((Xn > ε i.o.)c ) = 1. Thus, almost
surely, the sequences (Xn ) and (Xn 1(|Xn |≤ε) ) are equal from a point onwards. Therefore,
Pn Pn
k=1 Xk converges almost surely if and only if k=1 Xn 1(|Xn |≤ε) converges almost surely,
Pn
so in order to prove the theorem, it suffices to show that k=1 Xn 1(|Xn |≤ε) converges almost
surely. To this end, define Yn = Xn 1(|Xn |≤ε) − E(Xn 1(|Xn |≤ε) ). As the sequence (Xn ) is
independent, so is the sequence (Yn ). Also, Yn has mean zero and finite variance, and by
P∞ Pn
our assumptions, n=1 V Yn is finite. Therefore, by Theorem 1.4.2, it holds that k=1 Yn
Pn
converges almost surely as n tends to infinity. Thus, k=1 Xn 1(|Xn |≤ε) −E(Xn 1(|Xn |≤ε) ) and
Pn Pn
k=1 EXk 1(|Xk |≤ε) converge almost surely, allowing us to conclude that k=1 Xn 1(|Xn |≤ε)
converges almost surely. This completes the proof.
In this section, we prove the strong law of large numbers, a key result in modern probability
theory. Let (Xn ) be a sequence of independent, identically distributed integrable variables
Pn
with mean µ. Intuitively speaking, we would expect that n1 k=1 Xk in some sense converges
to µ. The strong law of large numbers shows that this is indeed the case, and that the
convergence is almost sure. In order to demonstrate the result, we first show two lemmas
which will help us to prove the general statement by proving a simpler statement. Both
1.5 The strong law of large numbers 25
lemmas consider the case of nonnegative variables. Lemma 1.5.1 establishes that in order to
Pn a.s. Pn a.s.
prove n1 k=1 Xk −→ µ, it suffices to prove n1 k=1 Xk 1(Xk ≤k) −→ µ, reducing to the case of
Pn a.s.
bounded variables. Lemma 1.5.2 establishes that in order to prove n1 k=1 Xk 1(Xk ≤k) −→ µ,
nk
it suffices to prove limk→∞ n1k i=1
P
(Xi 1(Xi ≤i) −EXi 1(Xi ≤i) ) = 0 for particular subsequences
(nk )k≥1 , reducing to a subsequence, and allowing us to focus our attention on bounded
variables with mean zero.
Lemma 1.5.1. Let (Xn ) be a sequence of independent, identically distributed variables with
Pn
common mean µ. Assume that Xn ≥ 0 for all n ≥ 1. Then n1 k=1 Xk converges almost
Pn
surely if and only if n1 k=1 Xk 1(Xk ≤k) converges almost surely, and in the affirmative, the
limits are the same.
Proof. Let ν denote the common distribution of the Xn . Applying Tonelli’s theorem, we find
X∞ ∞
X X∞ Z
P (Xn 6= Xn 1(Xn ≤n) ) = P (Xn > n) = 1(x>n) dν(x)
n=1 n=1 n=1
X ∞
∞ X Z ∞ X
X k Z
= 1(k<x≤k+1) dν(x) = 1(k<x≤k+1) dν(x)
n=1 k=n k=1 n=1
X∞ Z ∞ Z
X
= k1(k<x≤k+1) dν(x) ≤ x1(k<x≤k+1) dν(x)
k=1 k=1
Z ∞
≤ x dν(x) = µ. (1.21)
0
P∞
Thus, n=1 P (Xn 6= Xn 1(Xn ≤n) ) is finite, and so Lemma 1.2.11 allows us to conclude that
P (Xn 6= Xn 1(Xn ≤n) i.o.) = 0, which then implies that P (Xn = Xn 1(Xn ≤n) evt.) = 1. Hence,
almost surely, Xn and Xn 1(Xn ≤n) are equal from a point N onwards, where N is stochastic.
For n ≥ N , we therefore have
n N
1X 1X
(Xk − Xk 1(Xk ≤k) ) = (Xk − Xk 1(Xk ≤k) ),
n n
k=1 k=1
As the last term on the right-hand side tends almost surely to zero, the conclusions of the
lemma follows.
Lemma 1.5.2. Let (Xn ) be a sequence of independent, identically distributed variables with
common mean µ. Assume that for all n ≥ 1, Xn ≥ 0. For α > 1, define nk = [αk ], with [αk ]
26 Sequences of random variables
denoting the largest integer which is less than or equal to αk . It if holds for all α > 1 that
nk
1 X
lim (Xi 1(Xi ≤i) − EXi 1(Xi ≤i) ) = 0
k→∞ nk
i=1
1
Pn
almost surely, then n k=1 Xk 1(Xk ≤k) converges to µ almost surely.
Proof. First note that as α > 1, we have nk = [αk ] ≤ [αk+1 ] = nk+1 . Therefore, (nk ) is
increasing. Also, as [αk ] > αk − 1, nk tends to infinity as k tends to infinity. Define a
sequence (Yn ) by putting Yn = Xn 1(Xn ≤n) . Our assumption is then that for all α > 1, it
Pnk
holds that limk→∞ n1k i=1 Yi − EYi = 0 almost surely, and our objective is to demonstrate
1
Pn
that limn→∞ n k=1 Yi = µ almost surely. Let ν be the common distribution of the Xn .
Note that by the dominated convergence theorem,
Z ∞ Z ∞
lim EYn = lim EXn 1(Xn ≤n) = lim 1(x≤n) x dν(x) = x dν(x) = µ.
n→∞ n→∞ n→∞ 0 0
nk nk nk
1 X 1 X 1 X
lim Yi = lim Yi − EYi + lim EYi = µ,
k→∞ nk k→∞ nk k→∞ nk
i=1 i=1 i=1
1
Pn
almost surely. We will use this to prove that n k=1 Yk converges to µ. To do so, first note
that since αn − 1 < [αn ] ≤ αn , it holds that
nk(m) m nk(m)+1
1 X 1 X 1 X
Yi ≤ Yi ≤ Yi .
nk(m)+1 i=1
m i=1 nk(m) i=1
1.5 The strong law of large numbers 27
1
Pm
almost surely, proving that m i=1 Yi is almost surely convergent with limit µ. This con-
cludes the proof of the lemma.
Theorem 1.5.3 (The strong law of large numbers). Let (Xn ) be a sequence of independent,
Pn
identically distributed variables with common mean µ. It then holds that n1 k=1 Xk converges
almost surely to µ.
Proof. We first consider the case where Xn ≥ 0 for all n ≥ 1. Combining Lemma 1.5.1 and
Lemma 1.5.2, we find that in order to prove the result, it suffices to show that
nk
1 X
lim (Yi − EYi ) = 0 (1.22)
k→∞ nk
i=1
almost surely, where Yn = Xn 1(Xn ≤n) , nk = [αk ] and α > 1. In order to do so, by Lemma
P∞ Pnk
1.2.12, it suffices to show that for any ε > 0, k=1 P (| n1k i=1 (Yi − EYi )| ≥ ε) is finite.
Using Lemma 1.2.7, we obtain
∞ nk
! ∞ nk
!2
X 1 X 1 X 1 X
P (Yi − EYi ) ≥ ε ≤ 2 E (Yi − EYi )
nk i=1 ε nk i=1
k=1 k=1
∞ Xnk
1 X 1
= V Yi . (1.23)
ε2 n2k
k=1 i=1
28 Sequences of random variables
Now, as all terms in the above are nonnegative, we may apply Tonelli’s theorem to obtain
∞ Xnk ∞ X∞
X 1 X 1
2 V Y i = 1(i≤nk ) 2 V Yi
n
i=1 k i=1
n k
k=1 k=1
∞ X∞ ∞
X 1 X X 1
= 1(i≤nk ) V Y i = V Y i . (1.24)
n2k n2k
i=1 k=1 i=1 k:nk ≥i
We wish to identify a bound for the inner sum as a function of i. To this end, note that for
x ≥ 2, we have [x] ≥ x − 1 ≥ x2 , and for 1 ≤ x < 2, [x] = 1 ≥ x2 as well. Thus, for all x ≥ 1,
we have [x] ≥ x2 . Let mi = inf{k ≥ 1 | nk ≥ i}, we then have
∞ ∞ ∞
X 1 X 1 X 1 X 4α−2mi
2 = ≤ = 4 α−2k = ,
nk k
[α ] 2 k
(α /2)2 1 − α−2
k:nk ≥i k=mi k=mi k=mi
where we have applied the formula for summing a geometric series. Noting that mi satisfies
that αmi ≥ [αmi ] = nmi ≥ i, we obtain α−2mi = (α−mi )2 ≤ i−2 , resulting in the estimate
1 −2 −1 −2
P
k:nk ≥i n2k ≤ 4(1 − α ) i . Combining this with (1.23) and (1.24), we find that in
Pnk
order to show almost sure convergence of n1k i=1 (Yi − EYi ) to zero, it suffices to show that
P∞ 1
i=1 i2 V Yi is finite. To this end, let ν denote the common distribution of the Xn . We then
apply Tonelli’s theorem to obtain
∞ ∞ ∞ i
X 1 X 1 2
X 1 X
2
V Y i ≤ 2
EX 1
i (Xi ≤i) = 2
EXi2 1(j−1<Xi ≤j)
i=1
i i=1
i i=1
i j=1
∞ X i ∞ Z j ∞
1 j 2
Z
X X
2
X 1
= 2
x dν(x) = x dν(x) 2
. (1.25)
i=1 j=1
i j−1 j=1 j−1 i=j
i
P∞
which is finite, since the Xi has finite mean. We have now shown that i=1 i12 V Yi is conver-
Pnk
gent, and therefore we may now conclude that n1k i=1 (Yi − EYi ) converges to zero almost
Pn
surely, proving (1.22). Lemma 1.5.1 and Lemma 1.5.2 now yields that n1 k=1 Xk converges
almost surely to µ.
1.5 The strong law of large numbers 29
It remains to extend the result to the case where Xn is not assumed to be nonnegative.
Therefore, we now let (Xn ) be any sequence of independent, identically distributed variables
with mean µ. With x+ = max{0, x} and x− = max{0, −x}, Lemma 1.3.7 shows that
the sequences (Xn+ ) and (Xn− ) each are independent and identically distributed with finite
means x+ dν(x) and x− dν(x), and both sequences consist only of nonnegative variables.
R R
n n n Z Z
1X 1X + 1X −
lim Xk = lim Xk − lim Xk = +
x dν(x) − x− dν(x) = µ,
n→∞ n n→∞ n n→∞ n
k=1 k=1 k=1
as desired.
In fact, the convergence in Theorem 1.5.3 holds not only almost surely, but also in L1 . In
Chapter 2, we will obtain this as a consequence of a more general convergence theorem. Before
concluding the chapter, we give an example of a simple statistical application of Theorem
1.5.3.
Example 1.5.4. Consider a measurable space (Ω, F) endowed with a sequence of random
variables (Xn ). Assume given a parameter set Θ and a set of probability measures (Pθ )θ∈Θ
such that for the probability space (Ω, F, Pθ ), (Xn ) consists of independent and identically
distributed variables with second moment. Assume further that the first moment is ξθ and
that the second moment is σθ2 . Natural estimators of the mean and variance parameter
functions based on n samples are then
n n n
!2
1X 1X 1X
ξˆn = Xk and σ̂n2 = Xk − Xi .
n n n i=1
k=1 k=1
a.s.
Theorem 1.5.3 allows us to conclude that under Pθ , it holds that ξˆn −→ ξθ , and furthermore,
by two further applications of Theorem 1.5.3, as well as Lemma 1.2.6 and Lemma 1.2.10, we
obtain
n n n
!2 n n
!2
1X 2 2 X 1X 1X 2 1X a.s.
σ̂n2 = Xk − Xk Xi + Xi = Xk − Xi −→ σθ2 .
n n i=1
n i=1 n n i=1
k=1 k=1
The strong law of large numbers thus allows us to conclude that the natural estimators ξˆn
and σ̂n2 converge almost surely to the true mean and variance. ◦
30 Sequences of random variables
1.6 Exercises
Exercise 1.1. Let X be a random variable, and let (an ) be a sequence of real numbers
converging to zero. Define Xn = an X. Show that Xn converges almost surely to zero. ◦
Exercise 1.2. Give an example of a sequence of random variables (Xn ) such that (Xn )
converges in probability but does not converge almost surely to any variable. Give an example
of a sequence of random variables (Xn ) such that (Xn ) converges in probability but does not
converge in L1 to any variable. ◦
Exercise 1.3. Let (Xn ) be a sequence of random variables such that Xn is Poisson dis-
tributed with parameter 1/n. Show that Xn converges in L1 to zero. ◦
Exercise 1.4. Let (Xn ) be a sequence of random variables such that Xn is Gamma dis-
tributed with shape parameter n2 and scale parameter 1/n. Show that Xn does not converge
in L1 to any integrable variable. ◦
Exercise 1.5. Consider a probability space (Ω, F, P ) such that Ω is countable and such that
F is the power set of Ω. Let (Xn ) be a sequence of random variables (Xn ) on (Ω, F, P ), and
P a.s.
let X be another variable. Show that if Xn −→ X, then Xn −→ X. ◦
Exercise 1.6. Let (Xn ) be a sequence of random variables and let X be another variable.
P
Let (Fn ) be a sequence of sets in F. Assume that for all k, Xn 1Fk −→ X1Fk , and assume
P
that limk→∞ P (Fkc ) = 0. Show that Xn −→ X. ◦
Exercise 1.7. Let (Xn ) be a sequence of random variables and let X be another variable.
Let (εk )k≥1 be a sequence of nonnegative real numbers converging to zero. Show that Xn
converges in probability to X if and only if limn→∞ P (|Xn − X| ≥ εk ) = 0 for all k ≥ 1. ◦
Exercise 1.8. Let (Xn ) be an sequence of random variables, and let X be some other
a.s. P
variable. Show that Xn −→ X if and only if supk≥n |Xk − X| −→ 0. ◦
|X − Y |
d(X, Y ) = E .
1 + |X − Y |
Show that d is a pseudometric on the space of real stochastic variables, in the sense that
d(X, Y ) ≤ d(X, Z) + d(Z, Y ), d(X, Y ) = d(Y, X) and d(X, X) = 0 for all X, Y and Z. Show
1.6 Exercises 31
that d(X, Y ) = 0 if and only if X and Y are almost surely equal. Let (Xn ) be a sequence
P
of random variables and let X be some other variable. Show that Xn −→ X if and only if
limn→∞ d(Xn , X) = 0. ◦
Exercise 1.10. Let (Xn ) be a sequence of random variables. Show that there exists a
a.s.
sequence of positive constants (cn ) such that cn Xn −→ 0. ◦
Exercise 1.11. Let (Xn ) be a sequence of i.i.d. variables with mean zero. Assume that Xn
has fourth moment. Show that for all ε > 0,
∞ ∞
n
!
X 1X 4EX14 X 1
P Xk ≥ ε ≤ .
n=1
n ε4 n=1 n2
k=1
Use this to prove the following result: For a sequence (Xn ) of i.i.d. variables with fourth
Pn a.s.
moment and mean µ, it holds that n1 k=1 Xk −→ µ. ◦
Exercise 1.12. Let (Xn ) be a sequence of random variables and let X be some other variable.
P
Assume that there is p > 1 such that supn≥1 E|Xn |p is finite. Show that if Xn −→ X, then
Lq
E|X|p is finite and Xn −→ X for 1 ≤ q < p. ◦
Exercise 1.13. Let (Xn ) be a sequence of random variables, and let X be some other
a.s.
variable. Assume that Xn −→ X. Show that for all ε > 0, there exists F ∈ F with
P (F c ) ≤ ε such that
lim sup |Xn (ω) − X(ω)| = 0,
n→∞ ω∈F
Exercise 1.14. Let (Xn ) be a sequence of random variables. Let X be some other variable.
P∞ a.s.
Let p > 0. Show that if n=1 E|Xn − X|p is finite, then Xn −→ X. ◦
Exercise 1.15. Let (Xn ) be a sequence of random variables, and let X be some other
P
variable. Assume that almost surely, the sequence (Xn ) is increasing. Show that if Xn −→ X,
a.s.
then Xn −→ X. ◦
Exercise 1.16. Let (Xn ) be a sequence of random variables and let (εn ) be a sequence of
P∞ P∞
nonnegative constants. Show that if n=1 P (|Xn+1 − Xn | ≥ εn ) and n=1 εn are finite,
then (Xn ) converges almost surely to some random variable. ◦
Exercise 1.17. Let (Un ) be a sequence of i.i.d. variables with common distribution being
the uniform distribution on the unit interval. Define Xn = max{U1 , . . . , Un }. Show that Xn
32 Sequences of random variables
Exercise 1.18. Let (Xn ) be a sequence of random variables. Show that if there exists c > 0
P∞
such that n=1 P (Xn > c) is finite, then supn≥1 Xn is almost surely finite. ◦
Exercise 1.19. Let (Xn ) be a sequence of independent random variables. Show that if
P∞
supn≥1 Xn is almost surely finite, there exists c > 0 such that n=1 P (Xn > c) is finite. ◦
Exercise 1.20. Let (Xn ) be a sequence of i.i.d. random variables with common distribution
being the standard exponential distribution. Calculate P (Xn / log n > c i.o.) for all c > 0
and use the result to show that lim supn→∞ Xn / log n = 1 almost surely. ◦
Exercise 1.21. Let (Xn ) be a sequence of random variables, and let J be the corresponding
tail-σ-algebra. Let B ∈ B. Show that (Xn ∈ B i.o.) and (Xn ∈ B evt.) are in J . ◦
Exercise 1.22. Let (Xn ) be a sequence of random variables, and let J be the correspond-
ing tail-σ-algebra. Let B ∈ B and let (an ) be a sequence of real numbers. Show that if
Pn
limn→∞ an = 0, then (limn→∞ k=1 an−k+1 Xk ∈ B) is in J . ◦
Exercise 1.24. Let (Xn ) be a sequence of nonnegative random variables. Show that if
P∞ Pn
n=1 EXn is finite, then k=1 Xk is almost surely convergent. ◦
Exercise 1.25. Let (Xn ) be an i.i.d. sequence of independent random variables such that
P (Xn = 1) and P (Xn = −1) both are equal to 21 . Let (an ) be a sequence of real num-
Pn
bers. Show that the sequence k=1 ak Xk either is almost surely divergent or almost surely
P∞
convergent. Show that the sequence is almost surely convergent if n=1 a2n is finite. ◦
Exercise 1.26. Give an example of a sequence (Xn ) of independent variables with first
Pn Pn
moment such that k=1 Xk converges almost surely while k=1 EXk diverges. ◦
Exercise 1.27. Let (Xn ) be a sequence of independent random variables with EXn = 0.
P∞ Pn
Assume that n=1 E(Xn2 1(|Xn |≤1) + |Xn |1(|Xn |>1) ) is finite. Show that k=1 Xk is almost
surely convergent. ◦
1.6 Exercises 33
Exercise 1.28. Let (Xn ) be an sequence of independent and identically distributed random
variables. Show that E|X1 | is finite if and only if P (|Xn | > n i.o.) = 0. ◦
Exercise 1.29. Let (Xn ) be an sequence of independent and identically distributed random
Pn a.s.
variables. Assume that there is c such that n1 k=1 Xk −→ c. Show that E|X1 | is finite and
that EX1 = c. ◦
34 Sequences of random variables
Chapter 2
In Section 1.5, we proved the strong law of large numbers, which shows that for a sequence
(Xn ) of integrable, independent and identically distributed variables, the empirical means
converge almost surely to the true mean. A reasonable question is whether such a result may
be extended to more general cases. Consider a sequence (Xn ) where each Xn has the same
distribution ν with mean µ. If the dependence between the variables is sufficiently weak, we
may hope that the empirical means still converge to the true mean.
One fruitful case of sufficiently weak dependence turns out to be embedded in the notion
of a stationary stochastic process. The notion of stationarity is connected with the notion
of measure-preserving mappings. Our plan for this chapter is as follows. In Section 2.1, we
investigate measure-preserving mappings, in particular proving the ergodic theorem, which
is a type of law of large numbers. Section 2.2 investigates sufficient criteria for the ergodic
theorem to hold. Finally, in Section 2.3, we apply our results to stationary processes and
prove versions of the law of large numbers for such processes.
As in the previous chapter, we work in the context of a probability space (Ω, F, P ). Our
main interest of this section will be a particular type of measurable mapping T : Ω → Ω.
Recall that for such a mapping T , the image measure T (P ) is the measure on F defined by
36 Ergodicity and stationarity
Another way to state Definition 2.1.1 is thus that T is measure preserving precisely when
P (T −1 (F )) = P (F ) for all F ∈ F.
As the operation of taking the preimage T −1 (F ) is stable under complements and countable
unions, the set family IT in Definition 2.1.2 is in fact a σ-algebra.
We have now introduced three concepts: measure preservation of a mapping T , the invariant
σ-algebra for a mapping T and ergodicity for a mapping T . These will be the main objects
of study for this section. Before proceeding, we introduce a final auxiliary concept. Recall
that ◦ denotes function composition, in the sense that if T : Ω → Ω and X : Ω → R, X ◦ T
denotes the mapping from Ω to R defined by (X ◦ T )(ω) = X(T (ω)).
We are now ready to begin preparations for the main result of this section, the ergodic
theorem. Note that for T : Ω → Ω, it is sensible to consider T ◦ T , denoted T 2 , which is
defined by (T ◦ T )(ω) = T (T (ω)), and more generally, T n for some n ≥ 1. In the following,
T denotes some measurable mapping from Ω to Ω. The ergodic theorem states that if T is
measure preserving and ergodic, it holds for any variable X with p’th moment, p ≥ 1, that
Pn
the average n1 k=1 X ◦ T k−1 converges almost surely and in Lp to the mean EX. In order
to show the result, we first need a few lemmas.
Lemma 2.1.5. Let X be a random variable. It holds that X is invariant if and only if X is
IT measurable.
2.1 Measure preservation, invariance and ergodicity 37
Proof. First assume that X is invariant, and consider A ∈ B. We need to prove that (X ∈ A)
is in IT , which is equivalent to showing T −1 (X ∈ A) = (X ∈ A). To obtain this, we simply
note that as X ◦ T = X,
Proof. Fix n and define Mn = max{0, S1 , . . . , Sn }. Note that supn≥1 n1 Sn > 0 if and only if
there exists n such that n1 Sn > 0, which is the case if and only if there exists n such that
Mn > 0. As the sequence of sets ((Mn > 0))n≥1 is increasing, the dominated convergence
theorem then shows that
EX1(supn≥1 1
n Sn >0)
= EX1∪∞
n=1 (Mn >0)
= E lim X1(Mn >0) = lim EX1(Mn >0) ,
n→∞ n→∞
and so it suffices to prove that EX1(Mn >0) ≥ 0 for each n. To do so, fix n. Note that as T
is measure preserving, so is T n for all n. As Mn is nonnegative, we then have
n n X
i n X
i
X X X n(n + 1)
0 ≤ EMn ≤ E |Si | ≤ E |X| ◦ T k−1 = E|X| = E|X|,
i=1 i=1 k=1 i=1 k=1
2
which shows that Mn is integrable. As E(Mn ◦ T )1(Mn >0) ≤ E(Mn ◦ T ) = EMn by the
measure preservation property of T , (Mn ◦ T )1(Mn >0) is also integrable, and we have
Therefore, it suffices to show that (X + Mn ◦ T )1(Mn >0) ≥ Mn 1(Mn >0) , To do so, note that
Pk
for 1 ≤ k ≤ n − 1, it holds that X + Mn ◦ T ≥ X + Sk ◦ T = X + i=1 X ◦ T i = Sk+1 , and
also X + Mn ◦ T ≥ X = S1 . Therefore, X + Mn ◦ T ≥ max{S1 , . . . , Sn }. From this, it follows
that (X + Mn ◦ T )1(Mn >0) ≥ max{S1 , . . . , Sn }1(Mn >0) = Mn 1(Mn >0) , as desired.
38 Ergodicity and stationarity
Pn
Proof. We first consider the case where X has mean zero. Define Sn = k=1 X ◦ T k−1 . We
need to show that limn→∞ n1 Sn = 0 almost surely and in Lp . Put Y = lim supn→∞ n1 Sn , we
will show that almost surely, Y ≤ 0. If we can obtain this, a symmetry argument will then
allow us to obtain the desired conclusion. In order to prove that Y ≤ 0 almost surely, we
first take ε > 0 and show that P (Y > ε) = 0. To this end, we begin by noting that
n
1 1X 1
Y ◦ T = lim sup (Sn ◦ T ) = lim sup X ◦ T k = lim sup (Sn+1 − X) = Y,
n→∞ n n→∞ n n→∞ n
k=1
allowing us to conclude
= ∪∞ ∞ 0
n=1 (Y > ε) ∩ (Sn − nε > 0) = ∪n=1 (Sn > 0)
= ∪∞ 1 0 1 0
n=1 ( n Sn > 0) = (supn≥1 n Sn > 0). (2.1)
This relates the event (Y > ε) to the sequence (Sn0 ). Applying Lemma 2.1.6 and recalling
(2.1), we obtain E1(Y >ε) X 0 ≥ 0, which implies
Finally, recall that by ergodicity of T , P (Y > ε) is either zero or one. If P (Y > ε) is one,
(2.2) yields ε ≤ 0, a contradiction. Therefore, we must have that P (Y > ε) is zero. We now
use this to complete the proof of almost sure convergence. As P (Y > ε) is zero for all ε > 0,
2.1 Measure preservation, invariance and ergodicity 39
so applying the same result with −X instead of X, we also obtain − lim inf n→∞ n1 Sn ≤ 0
almost surely. All in all, this shows that 0 ≤ lim inf n→∞ n1 Sn ≤ lim supn→∞ n1 Sn ≤ 0
almost surely, so limn→∞ n1 Sn = 0 almost surely, as desired. Finally, considering the case
where EX is nonzero, we may use our previous result with the variable X − EX to obtain
Pn Pn
limn→∞ n1 k=1 X ◦ T k−1 = limn→∞ n1 k=1 (X − EX) ◦ T k−1 + EX = EX, completing the
proof of almost sure convergence in the general case.
We consider the each of the three terms on the right-hand side. For the first term, it holds
that kEX − EX 0 kp = |EX − EX 0 | = |EX1(|X|>m) | ≤ E|X|1(|X|>m) . As for the sec-
ond term, the results already proven show that n1 Sn0 converges almost surely to EX 0 . As
Pn
|Sn0 | = | n1 k=1 X 0 ◦ T k−1 | ≤ m, the dominated convergence theorem allows us to con-
clude limn→∞ E|X 0 − n1 Sn0 |p = E limn→∞ |X 0 − n1 Sn0 |p = 0, which implies that we have
limn→∞ kEX 0 − n1 Sn kp = 0. Finally, we may apply the triangle inequality and the measure
preservation property of T to obtain
n n n
1X 0 1X 1X
k n1 Sn0 − n1 Sn kp = X ◦ T k−1 − X ◦ T k−1 ≤ kX ◦ T k−1 − X 0 ◦ T k−1 kp
n n n
k=1 k=1 p k=1
n Z 1/p n Z 1/p
1 X
0 k−1 p 1X 0 p
= |(X − X ) ◦ T | dP = |(X − X )| dP
n n
k=1 k=1
By the dominated convergence theorem, both of these terms tend to zero as m tends to
infinity. As the bound in (2.3) holds for all m, we conclude lim supn→∞ kEX − n1 Sn kp = 0,
which yields convergence in Lp .
Theorem 2.1.7 shows that for any variable X with p’th moment and any measure preserving
and ergodic transformation T , a version of the strong law of large numbers holds for the
Pn
process (X ◦ T k−1 )k≥1 in the sense that n1 k=1 X ◦ T k−1 converges almosts surely and in
Lp to EX. Note that in this case, the measure preservation property of T shows that X and
X ◦ T k−1 have the same distribution for all k ≥ 1. Therefore, Theorem 2.1.7 is a type of law
of large numbers for processes of identical, but not necessarily independent variables.
To apply Theorem 2.1.7, we need to be able to show measure preservation and ergodicity. In
this section, we prove some sufficient criteria which will help make this possible in practical
cases. Throughout this section, T denotes a measurable mapping from Ω to Ω.
First, we consider a simple lemma showing that in order to prove that T is measure preserv-
ing, it suffices to check the claim only for a generating family which is stable under finite
intersections.
Lemma 2.2.1. Let H be a generating family for F which is stable under finite intersections.
If P (T −1 (F )) = P (F ) for all F ∈ H, then T is P -measure preserving.
Proof. As both P and T (P ) are probability measures, this follows from the uniqueness the-
orem for probability measures.
Next, we consider the somewhat more involved problem of showing that a measure preserving
mapping is ergodic. A simple first result is the following.
Theorem 2.2.2. Let T be measure preserving. Then T is ergodic if and only if every
invariant random variable is constant almost surely.
Proof. First assume that T is ergodic. Let X be an invariant random variable. By Lemma
2.1.5, X is IT measurable, so in particular (X ≤ x) ∈ IT for all x ∈ R. As T is ergodic, all
2.2 Criteria for measure preservation and ergodicity 41
events in IT have probability zero or one, so we find that P (X ≤ x) is zero or one for all
x ∈ R.
We claim that this implies that X is constant almost surely. To this end, we define c by
putting c = sup{x ∈ R | P (X ≤ x) = 0}. As we cannot have P (X ≤ x) = 1 for all x ∈ R,
{x ∈ R | P (X ≤ x) = 0} is nonempty, so c is not minus infinity. And as we cannot have
P (X ≤ x) = 0 for all x ∈ R, {x ∈ R | P (X ≤ x) = 0} is not all of R. As x 7→ P (X ≤ x)
is increasing, this implies that {x ∈ R | P (X ≤ x) = 0} is bounded from above, so c is not
infinity. Thus, c is finite.
Now, by definition, c is the least upper bound of the set {x ∈ R | P (X ≤ x) = 0}. Therefore,
any number strictly smaller than c is not an upper bound. From this we conclude that for
n ≥ 1, there is cn with c − n1 < cn such that P (X ≤ cn ) = 0. Therefore, we must also have
P (X ≤ c − n1 ) ≤ P (X ≤ cn ) = 0, and so P (X < c) = limn→∞ P (X ≤ c − n1 ) = 0. On the
other hand, as c is an upper bound for the set {x ∈ R | P (X ≤ x) = 0}, it holds for any
ε > 0 that P (X ≤ c + ε) 6= 0, yielding that for all ε > 0, P (X ≤ c + ε) = 1. Therefore,
P (X ≤ c) = limn→∞ P (X ≤ c + n1 ) = 1. All in all, we conclude P (X = c) = 1, so X is
constant almost surely. This proves the first implication of the theorem.
Next, assume that every invariant random variable is constant almost surely, we wish to
prove that T is ergodic. Let F ∈ IT , we have to show that P (F ) is either zero or one. Note
that 1F is IT measurable and so invariant by Lemma 2.1.5. Therefore, by our assumption,
1F is almost surely constant, and this implies that P (F ) is either zero or one. This proves
the other implication and so concludes the proof.
Theorem 2.2.2 is occasionally useful if the T -invariant random variables are easy to charac-
terize. The following theorem shows a different avenue for proving ergodicity based on a sort
of asymptotic independence criterion.
Theorem 2.2.3. Let T be P -measure preserving. T is ergodic if and only if it holds for all
Pn
F, G ∈ F that limn→∞ n1 k=1 P (F ∩ T −(k−1) (G)) = P (F )P (G).
Proof. First assume that T is ergodic. Fix F, G ∈ F. Applying Theorem 2.1.7 with the inte-
Pn
grable variable 1G , and noting that 1G ◦ T k−1 = 1T −(k−1) (G) , we find that n1 k=1 1T −(k−1) (G)
n
converges almost surely to P (G). Therefore, n1 k=1 1F 1T −(k−1) (G) converges almost surely
P
yields
n n
1X 1X
lim P (F ∩ T −(k−1) (G)) = lim E 1F 1T −(k−1) (G)
n→∞ n n→∞ n
k=1 k=1
n
1X
= E lim 1F 1T −(k−1) (G)
n→∞ n
k=1
proving the first implication. Next, we consider the other implication. Assume that for all
Pn
F, G ∈ F, limn→∞ n1 k=1 P (F ∩ T −(k−1) (G)) = P (F )P (G). We wish to show that T is
ergodic. Let F ∈ IT , we then obtain
n
1X
P (F ) = lim P (F ∩ T −(k−1) (F )) = P (F )2 ,
n→∞ n
k=1
Lemma 2.2.6. Let T be measure preserving, and let H be a generating family for F which
is stable under finite intersections. Assume that one of the following holds:
Pn
(1). limn→∞ 1
n k=1 P (F ∩ T −(k−1) (G)) = P (F )P (G) for all F, G ∈ H.
Pn
(2). limn→∞ 1
n k=1 |P (F ∩ T −(k−1) (G)) − P (F )P (G)| = 0 for all F, G ∈ H.
Proof. The proofs for the three cases are similar, so we only argue that the third claim holds.
Fix F ∈ H and define
n o
D = G ∈ F lim P (F ∩ T −n (G)) = P (F )P (G) .
n→∞
We wish to argue that D is a Dynkin class. To this end, note that since T −1 (Ω) = Ω, it holds
that Ω ∈ D. Take A, B ∈ D with A ⊆ B. We then also have T −n (A) ⊆ T −n (B), yielding
0 ≤ P (F ∩ T −n (A)) − P (F ∩ T −n (Ai ))
= P (F ∩ (T −n (A) \ T −n (Ai ))) = P (F ∩ (T −n (A \ Ai )))
≤ P (T −n (A \ Ai )) = P (A \ Ai ) ≤ ε.
= P (F )P (Ai ) + ε. (2.4)
44 Ergodicity and stationarity
= P (F )P (A) + ε. (2.5)
Combining Theorem 2.2.5 and Lemma 2.2.6, we find that in order to show ergodicity of T ,
it suffices to show that T is mixing or weakly mixing for events in a generating system for F
which is stable under finite intersections. This is in several cases a viable method for proving
ergodicity.
We will now apply the results from Section 2.1 and Section 2.2 to obtain laws of large numbers
for the class of processes known as stationary processes. In order to do so, we first need to
investigate in what sense we can consider the simultaneous distribution of an entire process
(Xn ). Once we have done so, we will be able to obtain our main results by applying the
ergodic theorem to this simultaneous distribution.
The results require some formalism. By Rn for n ≥ 1, we denote the n-fold product of R, the
set of n-tuples with elements from R. Analogously, we define R∞ as the set of all sequences
of real numbers, in the sense that R∞ = {(xn )n≥1 | xn ∈ R for all n ≥ 1}. Recall that
the Borel σ-algebra on Rn , defined as the smallest σ-algebra containing all open sets, also
is given as the smallest σ-algebra making all coordinate projections measurable. In analogy
2.3 Stationary processes and the law of large numbers 45
with this, we make the following definition of the Borel σ-algebra on R∞ . By X̂n : R∞ → R,
we denote the n’th coordinate projection of R∞ , X̂n (x) = xn , where x = (xn )n≥1 .
In detail, Definition 2.3.1 states the following. Let A be the family of all σ-algebras G on R∞
such that for all n ≥ 1, X̂n is G-B measurable. B∞ is then the smallest σ-algebra in the set
A of σ-algebras, explicitly constructed as B∞ = ∩G∈A G.
In the following lemmas, we prove some basic results on the measure space (R∞ , B∞ ). In
Lemma 2.3.2, a generating family which is stable under finite intersections is identified, and
in Lemma 2.3.3, the mappings which are measurable with respect to B∞ are identified. In
Lemma 2.3.4, we show how we can apply B∞ to describe and work with stochastic processes.
Lemma 2.3.2. Let K be a generating family for B which is stable under finite intersec-
tions. Define H as the family of sets {x ∈ R∞ | x1 ∈ B1 , . . . , xn ∈ Bn }, where n ≥ 1 and
B1 ∈ K, . . . , Bn ∈ K. H is then a generating family for B∞ which is stable under finite
intersections.
Proof. It is immediate that H is stable under finite intersections. Note that if F is a set such
that F = {x ∈ R∞ | x1 ∈ B1 , . . . , xn ∈ Bn } for some n ≥ 1 and B1 ∈ K, . . . , Bn ∈ K, we also
have
F = {x ∈ R∞ | x1 ∈ B1 , . . . , xn ∈ Bn }
= {x ∈ R∞ | X̂1 (x) ∈ B1 , . . . , X̂n (x) ∈ Bn }
= {x ∈ R∞ | x ∈ X̂1−1 (B1 ), . . . , x ∈ X̂n−1 (Bn )}
= ∩nk=1 X̂n−1 (Bk ).
Proof. As X̂n ◦ X = Xn and Xn is F-B measurable by assumption, the result follows from
Lemma 2.3.3.
Letting (Xn )n≥1 be a stochastic process, Lemma 2.3.4 shows that with X : Ω → R∞ defined
by X(ω) = (Xn (ω))n≥1 , X is F-B∞ measurable, and therefore, the image measure X(P )
is well-defined. This motivates the following definition of the distribution of a stochastic
process.
Definition 2.3.5. Letting (Xn )n≥1 be a stochastic process. The distribution of (Xn )n≥1 is
the probability measure X(P ) on B∞ .
Utilizing the above definitions and results, we can now state our plan for the main results
to be shown later in this section. Recall that one of our goals for this section is to prove an
extension of the law of large numbers. The method we will apply is the following. Consider
a stochastic process (Xn ). The introduction of the infinite-dimensional Borel-σ-algebra and
the measurability result in Lemma 2.3.4 have allowed us in Definition 2.3.5 to introduce the
concept of the distribution of a process. In particular, we have at our disposal a probability
space (R∞ , B∞ , X(P )). If we can identify a suitable transformation T : R∞ → R∞ such that
T is measure preserving and ergodic for X(P ), we will be able to apply Theorem 2.1.7 to
obtain a type of law of large numbers with X(P ) almost sure convergence and convergence in
Lp (R∞ , B∞ , X(P )). If we afterwards succeed in transfering the results from the probability
space (R∞ , B∞ , X(P )) back to the probability space (Ω, F, P ), we will have achieved our
goal.
Lemma 2.3.6. Let (Xn ) be a stochastic process. Define X : Ω → R∞ by X(ω) = (Xn (ω))n≥1 .
The image measure X(P ) is the unique probability measure on B∞ such that for all n ≥ 1
2.3 Stationary processes and the law of large numbers 47
Proof. Uniqueness follows from Lemma 2.3.2 and the uniqueness theorem for probability
measures. It remains to show that X(P ) satisfies (2.6). To this end, we note that
X(P )(∩nk=1 X̂k−1 (Bk )) = P (X −1 (∩nk=1 X̂k−1 (Bk ))) = P (∩nk=1 X −1 (X̂k−1 (Bk )))
= P (∩nk=1 (X̂k ◦ X)−1 (Bk ))) = P (∩nk=1 (Xk ∈ Bk ))
= P (X1 ∈ B1 , . . . , Xn ∈ Bn ).
Lemma 2.3.6 may appear rather abstract at a first glance. A clearer statement might be
obtained by noting that ∩nk=1 X̂k−1 (Bk ) = B1 × · · · × Bk × R∞ . The lemma then states that
the distribution X(P ) is the only probability measure on B∞ such that the X(P )-measure
of a “finite-dimensional rectangle” of the form B1 × · · · × Bk × R∞ has the same measure
as P (X1 ∈ B1 , . . . , Xn ∈ Bn ), a property reminiscent of the characterizing feature of the
distribution of an ordinary finite-dimensional random variable.
Using the above, we may now formalize the notion of a stationary process. First, we define
θ : R∞ → R∞ by putting θ((xn )n≥1 ) = (xn+1 )n≥1 . We refer to θ as the shift operator.
Note that by Lemma 2.3.3, θ is B∞ -B∞ measurable. The mapping θ will play the role of the
measure preserving and ergodic transformation in our later use of Theorem 2.1.7.
Definition 2.3.7. Let (Xn ) be a stochastic process. We say that (Xn ) is a stationary process,
or simply stationary, if it holds that θ is measure preserving for the distribution of (Xn ). We
say that a stationary process is ergodic if θ is ergodic for the distribution of (Xn ).
According to Definition 2.3.7, the property of being stationary is related to the measure
preservation property of the mapping θ on B∞ in relation to the measure X(P ) on B∞ , and
the property of being ergodic is related to the invariant σ-algebra of θ, which is a sub-σ-
algebra of B∞ . It is these conceptions of stationarity and ergodicity we will be using when
formulating our laws of large numbers. However, for practical use, it is convenient to be able
to express stationarity and ergodicity in terms of the probability space (Ω, F, P ) instead of
(R∞ , B∞ , X(P )). The following results will allow us to do so.
Lemma 2.3.8. Let (Xn ) be a stochastic process. The following are equivalent.
48 Ergodicity and stationarity
(2). (Xn )n≥1 and (Xn+1 )n≥1 have the same distribution.
(3). For all k ≥ 1, (Xn )n≥1 and (Xn+k )n≥1 have the same distribution.
Proof. We first prove that (1) implies (3). Assume that (Xn ) is stationary and fix k ≥ 1.
Define a process Y by setting Y = (Xn+k )n≥1 , we then also have Y = θk ◦ X. As (Xn ) is
stationary, θ is X(P )-measure preserving. By an application of Theorem A.2.13, this yields
Y (P ) = (θk ◦ X)(P ) = θk (X(P )) = X(P ), showing that (Xn )n≥1 and (Xn+k )n≥1 have the
same distribution, and so proving that (1) implies (3).
As it is immediate that (3) implies (2), we find that in order to complete the proof, it suffices
to show that (2) implies (1). Therefore, assume that (Xn )n≥1 and (Xn+1 )n≥1 have the same
distribution, meaning that X(P ) and Y (P ) are equal, where Y = (Xn+1 )n≥1 . We then obtain
θ(X(P )) = (θ ◦ X)(P ) = Y (P ) = X(P ), so θ is X(P )-measure preserving. This proves that
(2) implies (1), as desired.
Lemma 2.3.9. Let (Xn ) be a stationary stochastic process. For all k ≥ 1 and n ≥ 1,
(X1 , . . . , Xn ) has the same distribution as (X1+k , . . . , Xn+k ).
Proof. Fix k ≥ 1 and n ≥ 1. Let Y = (Xn+k )n≥1 . By Lemma 2.3.8, it holds that (Xn )n≥1
and (Yn )n≥1 have the same distribution. Let ϕ : R∞ → Rn denote the projection onto the
first n coordinates of R∞ . Using Theorem A.2.13, we then obtain
proving that (X1 , . . . , Xn ) has the same distribution as (X1+k , . . . , Xn+k ), as was to be
shown.
Definition 2.3.10. Let (Xn ) be a stationary process. The invariant σ-algebra I(X) for the
process is defined by I(X) = {X −1 (B) | B ∈ B∞ , B is invariant for θ}.
2.3 Stationary processes and the law of large numbers 49
Lemma 2.3.11. Let (Xn ) be a stationary process. (Xn ) is ergodic if and only if it holds that
for all F ∈ I(X), P (F ) is either zero or one.
Proof. First assume that (Xn ) is ergodic, meaning that θ is ergodic for X(P ). This means
that with Iθ denoting the invariant σ-algebra for θ on B∞ , X(P )(B) is either zero or one for
all B ∈ Iθ . Now let F ∈ I(X), we then have F = (X ∈ B) for some B ∈ Iθ , so we obtain
P (F ) = P (X ∈ B) = X(P )(B), which is either zero or one. This proves the first implication.
Next, assume that for all F ∈ I(X), P (F ) is either zero or one. We wish to show that (Xn )
is ergodic. Let B ∈ Iθ . We then obtain X(P )(B) = P (X −1 (B)), which is either zero or one
as X −1 (B) ∈ I(X). Thus, (Xn ) is ergodic.
Lemma 2.3.8 and Lemma 2.3.11 shows how to reformulate the definitions in Definition 2.3.7
more concretely in terms of the probability space (Ω, F, P ) and the process (Xn ). We are now
ready to use the ergodic theorem to obtain a law of large numbers for stationary processes.
Theorem 2.3.12 (Ergodic theorem for ergodic stationary processes). Let (Xn ) be an ergodic
stationary process, and let f : R∞ → R be some B∞ -B measurable mapping. If f ((Xn )n≥1 )
Pn
has p’th moment, n1 k=1 f ((Xi )i≥k ) converges almost surely and in Lp to Ef ((Xi )i≥1 ).
Proof. We first investigate what may be obtained by using the ordinary ergodic theorem of
Theorem 2.1.7. Let P̂ = X(P ), the distribution of (Xn ). By our assumptions, θ is P̂ -measure
preserving and ergodic. Also, f is a random variable on the probability space (R∞ , B∞ , P̂ ),
and
Z Z Z
p
|f | dP̂ = |f | dX(P ) = |f ◦ X|p dP = E|f (X)|p = E|f ((Xn )n≥1 )|p ,
p
in the sense of P̂ almost sure convergence and convergence in Lp (R∞ , B∞ , P̂ ). These are limit
results on the probability space (R∞ , B∞ , P̂ ). We would like to transfer these results to our
original probability space (Ω, F, P ). We first consider the case of almost sure convergence.
Pn
We wish to argue that n1 k=1 f ((Xi )i≥k ) converges P -almost surely to µ. To do so, first
50 Ergodicity and stationarity
note that
n
! ( n
)
1X 1X
f ((Xi )i≥k ) converges to µ = ω ∈ Ω lim f ((Xi )i≥k (ω)) = µ
n n→∞ n
k=1 k=1
( n
)
1X
= ω ∈ Ω lim f ((Xi (ω))i≥k ) = µ
n→∞ n
k=1
( n
)
1X k−1
= ω ∈ Ω lim f (θ (X(ω))) = µ ,
n→∞ n
k=1
Pn
or, with our usual probabilistic notation, A = (limn→∞ n1 k=1 f ◦ θk−1 = µ). Therefore, we
obtain
n
!
1X
P f ((Xi )i≥k ) converges to µ = P (X −1 (A)) = X(P )(A)
n
k=1
n
!
1X k−1
= P̂ (A) = P̂ lim f ◦θ =µ ,
n→∞ n
k=1
Pn
and the latter is equal to one by the P̂ -almost sure convergence of n1 k=1 f ◦ θk−1 to µ. This
n
proves P -almost sure convergence of n1 k=1 f ((Xi )i≥k ) to µ. Next, we consider convergence
P
n
in Lp . Here, we need limn→∞ E|µ − n1 k=1 f ((Xi )i≥k )|p = 0. To obtain this, we note that
P
n p ! n p
1X 1X
µ− f ((Xi )i≥k ) (ω) = µ − f ((Xi (ω))i≥k )
n n
k=1 k=1
n p
1X
= µ− f (θk−1 (X(ω)))
n
k=1
n p!
1X k−1
= µ− f ◦θ (X(ω)),
n
k=1
yielding
n p n p !
1X 1X
µ− f ((Xi )i≥k ) = µ− f ◦ θk−1 ◦ X,
n n
k=1 k=1
2.3 Stationary processes and the law of large numbers 51
Theorem 2.3.12 is the main theorem of this section. As the following corollary shows, a
simpler version of the theorem is obtained by applying the theorem to a particular type of
function from R∞ to R.
Corollary 2.3.13. Let (Xn ) be an ergodic stationary process, and let f : R → R be some
Pn
Borel measurable mapping. If f (X1 ) has p’th moment, n1 k=1 f (Xk ) converges almost surely
and in Lp to Ef (X1 ).
Theorem 2.3.12 and Corollary 2.3.13 yields powerful convergence results for stationary and
ergodic processes. Next, we show that our results contain the strong law of large numbers
for independent and identically distributed variables as a special case. In addition, we also
obtain Lp convergence of the empirical means. To show this result, we need to prove that
sequences of independent and identically distributed variables are stationary and ergodic.
Corollary 2.3.14. Let (Xn ) be a sequence of independent, identically distributed variables.
Then (Xn ) is stationary and ergodic. Assume furthermore that Xn has p’th moment for some
Pn
p ≥ 1, and let µ be the common mean. Then n1 k=1 Xk converges to µ almost surely and
in Lp .
52 Ergodicity and stationarity
Proof. We first show that (Xn ) is stationary. Let ν denote the common distribution of the
Xn . Let X = (Xn )n≥1 and Y = (Xn+1 )n≥1 . Fix n ≥ 1 and B1 , . . . , Bn ∈ B, we then obtain
n
Y
Y (P )(∩nk=1 X̂k−1 (Bk )) = P (X2 ∈ B1 , . . . , Xn+1 ∈ Bn ) = ν(Bi )
i=1
so by Lemma 2.3.2 and the uniqueness theorem for probability measures, we conclude that
Y (P ) = X(P ), and thus (Xn ) is stationary. Next, we show that (Xn ) is ergodic. Let I(X)
denote the invariant σ-algebra for (Xn ), and let J denote the tail-σ-algebra for (Xn ), see
Definition 1.3.9. Let F ∈ I(X), we then have F = (X ∈ B) for some B ∈ Iθ , where Iθ is
the invariant σ-algebra on R∞ for the shift operator. Therefore, for any n ≥ 1, we obtain
In order to apply Theorem 2.3.12 and Corollary 2.3.13 in general, we need results on how
to prove stationarity and ergodicity. As the final theme of this section, we show two such
results.
Lemma 2.3.15. Let (Xn ) be stationary. Assume that for all m, p ≥ 1, A1 , . . . , Am ∈ B and
B1 , . . . , Bp ∈ B:
p
(1). With F = ∩mi=1 (Xi ∈ Ai ) and Gk = ∩i=1 (Xi+k−1 ∈ Bi ) for k ≥ 1, it holds that
n
limn→∞ n1 k=1 P (F ∩ Gk ) = P (F )P (G1 ).
P
p
(2). With F = ∩mi=1 (Xi ∈ Ai ) and Gk = ∩i=1 (Xi+k−1 ∈ Bi ) for k ≥ 1, it holds that
n
limn→∞ n1 k=1 |P (F ∩ Gk ) − P (F )P (G1 )| = 0.
P
p
(3). With F = ∩m i=1 (Xi ∈ Ai ) and Gn = ∩i=1 (Xi+n ∈ Bi ) for n ≥ 1, it holds that
limn→∞ P (F ∩ Gn ) = P (F )P (G1 ).
Proof. We only prove the result in the case where the third convergence holds, as the other
two cases follow similarly. Therefore, assume that the first criterion holds, such that for all
m, p ≥ 1, A1 , . . . , Am ∈ B and B1 , . . . , Bp ∈ B, it holds that
p p
lim P (∩m m
i=1 (Xi ∈ Ai ) ∩ ∩i=1 (Xi+n ∈ Bi )) = P (∩i=1 (Xi ∈ Ai ))P (∩i=1 (Xi ∈ Bi )). (2.8)
n→∞
We wish to show that (Xn ) is ergodic. Recall that by Definition 2.3.7 that since (Xn ) is
stationary, θ is measure preserving for P̂ , where P̂ = X(P ). Also recall from Definition
2.3.7 that in order to show that (Xn ) is ergodic, we must show that θ is ergodic for P̂ . We
will apply Lemma 2.2.6 and Theorem 2.2.5 to the probability space (R∞ , B∞ , P̂ ) and the
transformation θ. Note that as θ is measure preserving for P̂ , Lemma 2.2.6 and Theorem
2.2.5 are applicable.
−1
However, for any F, G ∈ H, we have that there is m, p ≥ 1 such that F = ∩m
i=1 X̂i (Ai ) and
p −1
G = ∩i=1 X̂i (Bi ), and so
p
and similarly, we obtain P̂ (F ) = P (∩m i=1 (Xi ∈ Ai )) and P̂ (G) = P (∩i=1 (Xi ∈ Ai )). Thus,
−1 p −1
for F, G ∈ H with F = ∩m i=1 X̂i (Ai ) and G = ∩i=1 X̂i (Bi ), (2.9) is equivalent (2.8). As
we have assumed that (2.8) holds for all m, p ≥ 1, A1 , . . . , Am ∈ B and B1 , . . . , Bp ∈ Bm , we
conclude that (2.9) holds for all F, G ∈ H. Lemma 2.2.6 then allows us to conclude that (2.9)
holds for all F, G ∈ B∞ , and Theorem 2.2.5 then allows us to conclude that θ is ergodic for
P̂ , so that (Xn ) is ergodic, as desired.
Proof. We first derive a formal expression for the sequence (Yn ) in terms of (Xn ). Define a
mapping Φ : R∞ → R∞ by putting, for k ≥ 1, Φ((xi )i≥1 )k = ϕ((xi )i≥k ). Equivalently, we
also have Φ((xi )i≥1 )k = (ϕ ◦ θk−1 )((xi )i≥1 ). As θ is B∞ -B∞ measurable by Lemma 2.3.3
and ϕ is B∞ -B measurable, Φ has B∞ measurable coordinates, and so is B∞ -B∞ measurable,
again by Lemma 2.3.3. And we have (Yn ) = Φ((Xn )n≥1 ).
Now assume that (Xn ) is stationary. Let P̂ be the distribution of (Xn ), and let Q̂ be the
distribution of (Yn ). By Definition 2.3.7, our assumption that (Xn ) is stationary means that
θ is measure preserving for P̂ , and in order to show that (Yn ) is stationary, we must show
that θ is measure preserving for Q̂. To do so, we note that for all k ≥ 1, it holds that
θ(Φ((xi )i≥1 ))k = Φ((xi )i≥1 )k+1 = ϕ(θk ((xi )i≥1 )) = ϕ(θk−1 (θ((xi )i≥1 ))) = Φ(θ((xi )i≥1 ))k ,
proving that θ also is measure preserving for Q̂, so (Yn ) is stationary. Next, assume that
(Xn ) is ergodic. By Definition 2.3.7, this means that all elements of the invariant σ-algebra
Iθ of θ has P̂ measure zero or one. We wish to show that (Yn ) is ergodic, which means
that we need to show that all elements of Iθ has Q̂ measure zero or one. Let A ∈ Iθ , such
that θ−1 (A) = A. We then have Q̂(A) = P̂ (Φ−1 (A)), so it suffices to show that Φ−1 (A) is
invariant for θ, and this follows as
θ−1 (Φ−1 (A)) = (Φ ◦ θ)−1 (A) = (θ ◦ Φ)−1 (A) = Φ−1 (θ−1 (A)) = Φ−1 (A).
Thus, Φ−1 (A) is invariant for θ. As θ is ergodic for P̂ , P̂ (Φ−1 (A)) is either zero or one, and
so Q̂(A) is either zero or one. Therefore, θ is ergodic for Q̂. This shows that (Yn ) is ergodic,
as desired.
We end the section with an example showing how to apply the ergodic theorem to obtain
limit results for empirical averages for a practical case of a process consisting of variables
which are not independent.
that EX1 X2 = p2 and that X1 X2 has moments of all orders, Theorem 2.3.12 shows that
1
Pn 2 p
n k=1 Xk Xk+1 converges to p almost surely and in L for all p ≥ 1. ◦
2.4 Exercises
Exercise 2.1. Consider the probability space ([0, 1), B[0,1) , P ) where P is the Lebesgue
measure. Define T (x) = 2x − [2x] and S(x) = x + λ − [x + λ], λ ∈ R. Here, [x] is the unique
integer satisfying [x] ≤ x < [x] + 1. Show that T and S are P -measure preserving. ◦
Exercise 2.2. Define T : [0, 1) → [0, 1) by letting T (x) = x1 − [ x1 ] for x > 0 and zero
otherwise. Show that T is Borel measurable. Define P as the nonnegative measure on
([0, 1), B[0,1) ) with density t 7→ log1 2 1+t
1
with respect to the Lebesgue measure. Show that P
is a probability measure, and show that T is measure preserving for P . ◦
Exercise 2.3. Define T : [0, 1] → [0, 1] by putting T (x) = 12 x for x > 0 and one other-
wise. Show that there is no probability measure P on ([0, 1], B[0,1] ) such that T is measure
preserving for P . ◦
Exercise 2.4. Consider the probability space ([0, 1), B[0,1) , P ) where P is the Lebesgue
measure. Define T : [0, 1) → [0, 1) by T (x) = x + λ − [x + λ]. T is then P -measure preserving.
Show that if λ is rational, T is not ergodic. ◦
Exercise 2.5. Let (Ω, F, P ) be a probability space and let T be measure preserving. Let
X be an integrable random variable and assume that X ◦ T ≤ X almost surely. Show that
X = X ◦ T almost surely. ◦
Exercise 2.7. Give an example of a probability space (Ω, F, P ) and a measurable mapping
T : Ω → Ω such that T 2 is measure preserving but T is not measure preserving. ◦
Exercise 2.8. Let (Ω, F, P ) be a probability space and let T be measurable and measure
preserving. We may then think of T as a random variable with values in (Ω, F). Let F ∈ F.
Exercise 2.9. Let (Ω, F) be a measurable space and let T : Ω → Ω be measurable. As-
sume that T is measure preserving. Show that the mapping T is ergodic if and only if it
holds for all random variables X and Y such that X is integrable and Y is bounded that
Pn
limn→∞ n1 k=1 EY (X ◦ T k−1 ) = (EY )(EX). ◦
Exercise 2.10. Consider the probability space ([0, 1), B[0,1) , P ) where P is the Lebesgue
measure. Define T : [0, 1) → [0, 1) by T (x) = 2x − [2x]. T is then P -measure preserving.
Show that T is mixing. ◦
Exercise 2.11. Let (Ω1 , F1 , P1 ) and (Ω2 , F2 , P2 ) be two probability spaces. Consider two
measurable mappings T1 : Ω1 → Ω1 and T2 : Ω2 → Ω2 . Assume that T1 is P1 -measure
preserving and that T2 is P2 -measure preserving. Define a probability space (Ω, F, P ) by
putting (Ω, F, P ) = (Ω1 × Ω2 , F1 ⊗ F2 , P1 ⊗ P2 ). Define a mapping T : Ω → Ω by putting
T (ω1 , ω2 ) = (T1 (ω1 ), T2 (ω2 )).
(2). Let IT1 , IT2 and IT be the invariant σ-algebras for T1 , T2 and T . Show that the
inclusion IT1 ⊗ IT2 ⊆ IT holds.
(4). Argue that T is mixing if and only if both T1 and T2 are mixing.
◦
Exercise 2.12. Let (Xn ) be a stationary process. Fix B ∈ B. Show that (Xn ∈ B i.o.) is
in I(X). ◦
Exercise 2.13. Let (Xn ) and (Yn ) be two stationary processes. Let U be a random variable
concentrated on {0, 1} with P (U = 1) = p, and assume that U is independent of X and
independent of Y . Define Zn = Xn 1(U =0) + Yn 1(U =1) . Show that (Zn ) is stationary. ◦
Exercise 2.14. We say that a process (Xn ) is weakly stationary if it holds that Xn has
second moment for all n ≥ 1, EXn = EXk for all n, k ≥ 1 and Cov(Xn , Xk ) = γ(|n − k|) for
2.4 Exercises 57
some γ : N0 → R. Now assume that (Xn ) is some process such that Xn has second moment
for all n ≥ 1. Show that if (Xn ) is stationary, (Xn ) is weakly stationary. ◦
Exercise 2.15. We say that a process (Xn ) is Gaussian if all of its finite-dimensional
distributions are Gaussian. Let (Xn ) be some Gaussian process. Show that (Xn ) is stationary
if and only if it is weakly stationary in the sense of Exercise 2.14. ◦
58 Ergodicity and stationarity
Chapter 3
Weak convergence
While our main results in Chapter 1 and Chapter 2 were centered around almost sure and Lp
Pn
convergence of n1 k=1 Xk for various classes of processes (Xn ), the theory of weak conver-
gence covered in this chapter will instead allow us to understand the asymptotic distribution
Pn
of n1 k=1 Xk , particularly through the combined results of Section 3.5 and Section 3.6.
After this, in Section 3.5, we prove several versions of the central limit theorem which in its
60 Weak convergence
Pn
simplest form states that under certain regularity conditions, the empirical mean n1 k=1 Xk
of independent and identically distributed random variables can for large n be approximated
Pn
by a normal distribution with the same mean and variance as n1 k=1 Xk . This is arguably
the main result of the chapter, and is a result which is of great significance in practical
statistics. In Section 3.6, we introduce the notion of asymptotic normality, which provides a
convenient framework for understanding and working with the results of Section 3.5. Finally,
in Section 3.7, we state without proof some multidimensional analogues of the results of the
previous sections.
Recall from Definition 1.2.2 that for a sequence of random variables (Xn ) and another random
D
variable X, we say that Xn converges in distribution to X and write Xn −→ X when
limn→∞ Ef (Xn ) = Ef (X) for all bounded, continuous mappings f : R → R. Our first
results of this section will show that convergence in distribution of random variables in a
certain sense is equivalent to a related mode of convergence for probability measures.
Definition 3.1.1. Let (µn ) be a sequence of probability measures on (R, B), and let µ be
wk
another probability measure. We say that µn converges weakly to µ and write µn −→ µ if it
R R
holds for all bounded, continuous mappings f : R → R that limn→∞ f dµn = f dµ.
Lemma 3.1.2. Let (Xn ) be a sequence of random variables and let X be another random
variable. Let µ denote the distribution of X, and for n ≥ 1, let µn denote the distribution of
D wk
Xn . Then Xn −→ X if and only if µn −→ µ.
R R R
Proof. We have Ef (Xn ) = f ◦ Xn dP = f dXn (P ) = f dµn , and by similar arguments,
R
Ef (X) = f dµ. From these observations, the result follows.
Lemma 3.1.2 clarifies that convergence in distribution of random variables is a mode of con-
vergence depending only on the marginal distributions of the variables involved. In particular,
we may investigate the properties of convergence in distribution of random variables by inves-
tigating the properties of weak convergence of probability measures on (R, B). Lemma 3.1.2
also allows us to make sense of convergence of random variables to a probability measure
in the following manner: We say that Xn converges in distribution to µ for a sequence of
D
random variables (Xn ) and a probability measure µ, and write Xn −→ µ, if it holds that
wk
µn −→ µ, where µn is the distribution of Xn .
3.1 Weak convergence and convergence of measures 61
The topic of weak convergence of probability measures in itself provides ample opportunities
for a rich mathematical theory. However, there is good reason for considering both conver-
gence in distribution of random variables and weak convergence of probability measures, in
spite of the apparent equivalence of the two concepts: Many results are formulated most
naturally in terms of random variables, particularly when transformations of the variables
are involved, and furthermore, expressing results in terms of convergence in distribution for
random variables often fit better with applications.
In the remainder of this section, we will prove some basic properties of weak convergence of
probability measures. Our first interest is to prove that weak limits of probability measures
are unique. By Cb (R), we denote the set of bounded, continuous mappings f : R → R, and
by Cbu (R), we denote the set of bounded, uniformly continuous mappings f : R → R. Note
that Cbu (R) ⊆ Cb (R).
Lemma 3.1.3. Assume given two intervals [a, b] ⊆ (c, d). There exists a function f ∈ Cbu (R)
such that 1[a,b] (x) ≤ f (x) ≤ 1(c,d) (x) for all x ∈ R.
Proof. As [a, b] ⊆ (c, d), we have that a ≤ x ≤ b implies c < x < d. In particular, c < a and
b < d. Then, to obtain the desired mapping, we simply define
0
for x ≤ c
x−c
a−c for c < x < a
f (x) = 1 for a ≤ x ≤ b ,
d−x
for b < x < d
d−b
0 for d ≤ x
The mappings whose existence are proved in Lemma 3.1.3 are known as Urysohn functions,
and are also occasionally referred to as bump functions, although this latter name in general
is reserved for functions which have continuous derivatives of all orders. Existence results of
this type often serve to show that continuous functions can be used to approximate other
types of functions. Note that if [a, b] ⊆ (c, d) and 1[a,b] (x) ≤ f (x) ≤ 1(c,d) (x) for all x ∈ R, it
then holds for x ∈ [a, b] that 1 = 1[a,b] (x) ≤ f (x) ≤ 1(c,d) (x) = 1, so that f (x) = 1. Likewise,
for x ∈/ (c, d), we have 0 = 1[a,b] (x) ≤ f (x) ≤ 1(c,d) (x) = 0, so f (x) = 0.
In the following lemma, we apply Lemma 3.1.3 to show a useful criterion for two probability
measures to be equal, from which we will obtain as an immediate corollary the uniqueness
of limits for weak convergence of probability measures.
62 Weak convergence
R R
Lemma 3.1.4. Let µ and ν be two probability measures on (R, B). If f dµ = f dν for
all f ∈ Cbu (R), then µ = ν.
R R
Proof. Let µ and ν be probability measures on (R, B) and assume that f dµ = f dν for
all f ∈ Cbu (R). By the uniqueness theorem for probability measures, we find that in order to
prove that µ = ν, it suffices to show that µ((a, b)) = ν((a, b)) for all a < b. To this end, let
a < b be given. Now pick n ∈ N so large that a + 1/n < b − 1/n. By Lemma 3.1.3, there then
exists a mapping fn ∈ Cbu (R) such that 1[a+1/n,b−1/n] ≤ f ≤ 1(a,b) . By our assumptions, we
R R
then have fn dµ = fn dν.
Lemma 3.1.5. Let (µn ) be a sequence of probability measures on (R, B), and let µ and ν be
wk wk
two other such probability measures. If µn −→ µ and µn −→ ν, then µ = ν.
R R
Proof. For all f ∈ Cb (R), we obtain f dν = limn→∞ f dµn = f dµ. In particular, this
holds for f ∈ Cbu (R). Therefore, by Lemma 3.1.4, it holds that ν = µ.
Lemma 3.1.5 shows that limits for weak convergence of probability measures are uniquely
determined. Note that this is not the case for convergence in distribution of variables. To
understand the issue, note that combining Lemma 3.1.2 and Lemma 3.1.5, we find that if
D D
Xn −→ X, then we also have Xn −→ Y if and only if X and Y have the same distribution.
D
Thus, for example, if Xn −→ X, where X is normally distributed with mean zero, then
D
Xn −→ −X as well, since X and −X have the same distribution, even though it holds that
P (X = −X) = P (X = 0) = 0.
R R
In order to show weak convergence of µn to µ, we need to prove limn→∞ f dµn = f dµ for
all f ∈ Cb (R). A natural question is whether it suffices to prove this limit result for a smaller
class of mappings than Cb (R). We now show that it in fact suffices to consider elements of
Cbu (R). For f : R → R bounded, we denote by kf k∞ the uniform norm of f , meaning that
3.1 Weak convergence and convergence of measures 63
kf k∞ = supx∈R |f (x)|. Before obtaining the result, we prove the following useful lemma.
Sequences of probability measures satisfying the property (3.1) referred to in the lemma are
said to be tight.
Lemma 3.1.6. Let (µn ) be a sequence of probability measures on (R, B), and let µ be some
other probability measure. If limn→∞ f dµn = f dµ for all f ∈ Cbu (R), it holds that
R R
Proof. Fix ε > 0. We will argue that there is M > 0 such that µn ([−M, M ]c ) ≤ ε for n ≥ 1.
To this end, let M ∗ > 0 be so large that µ([−M ∗ /2, M ∗ /2]c ) < ε. By Lemma 3.1.3, we find
that there exists a mapping g ∈ Cbu (R) with 1[−M ∗ /2,M ∗ /2] (x) ≤ g(x) ≤ 1(−M ∗ ,M ∗ ) (x) for
x ∈ R. With f = 1 − g, we then also obtain 1(−M ∗ ,M ∗ )c (x) ≤ f (x) ≤ 1[−M ∗ /2,M ∗ /2]c (x). As
f ∈ Cbu (R) as well, this yields
Z Z
lim sup µn ([−M ∗ , M ∗ ]c ) ≤ lim sup 1(−M ∗ ,M ∗ )c dµn ≤ lim sup f dµn
n→∞ n→∞ n→∞
Z Z
= f dµ ≤ 1[−M ∗ /2,M ∗ /2]c dµ = µ([−M ∗ /2, M ∗ /2]c ) < ε,
and thus µn ([−M ∗ , M ∗ ]c ) < ε for n large enough, say n ≥ m. Now fix M1 , . . . , Mm > 0 such
that µn ([−Mn , Mn ]c ) < ε for n ≤ m. Putting M = max{M ∗ , M1 , . . . , Mm }, we obtain that
µn ([−M, M ]c ) ≤ ε for all n ≥ 1. This proves (3.1).
Theorem 3.1.7. Let (µn ) be a sequence of probability measures on (R, B), and let µ be
wk R R
some other probability measure. Then µn −→ µ if and only if limn→∞ f dµn = f dµ for
f ∈ Cbu (R).
wk
Proof. As Cbu (R) ⊆ Cb (R), it is immediate that if µn −→ µ, then limn→∞ f dµn = f dµ
R R
for f ∈ Cbu (R). We need to show the converse. Therefore, assume that for f ∈ Cbu (R),
R R R R
we have limn→∞ f dµn = f dµ. We wish to show that limn→∞ f dµn = f dµ for all
f ∈ Cb (R).
Using Lemma 3.1.6, take M > 0 such that µn ([−M, M ]c ) ≤ ε for all n ≥ 1 and such that
µ([−M, M ]c ) ≤ ε as well. Note that for any h ∈ Cb (R) such that khk∞ ≤ kf k∞ and
64 Weak convergence
and similarly for µn instead of µ. To complete the proof, we now take f ∈ Cb (R), we wish
to show limn→∞ f dµn = f dµ. To this end, we locate h ∈ Cbu (R) agreeing with f on
R R
and similarly, |h(x) − h(y)| ≤ |f (−M )||x − y| for x, y < −M . We conclude that h is a con-
tinuous function which is uniformly continuous on (−∞, −M ), on [−M, M ] and on (M, ∞).
Hence, h is uniformly continuous on R. Furthermore, h agrees with f on [−M, M ]. Collecting
our conclusions, we now obtain by (3.2) that
Z Z Z Z Z Z Z Z
f dµn − f dµ ≤ f dµn − h dµn + h dµn − h dµ + h dµ − f dµ
Z Z
≤ 4εkf k∞ + h dµn − h dµ ,
R R
leading to lim supn→∞ | f dµn − f dµ| ≤ 4εkf k∞ . As ε > 0 was arbitrary, this shows
R R wk
limn→∞ f dµn = f dµ, proving µn −→ µ.
Before turning to a few examples, we prove some additional basic results on weak convergence.
Lemma 3.1.8 and Lemma 3.1.9 give results which occasionally are useful for proving weak
convergence.
Lemma 3.1.8. Let (µn ) be a sequence of probability measures on (R, B), and let µ be some
wk
other probability measure on (R, B). Let h : R → R be a continuous mapping. If µn −→ µ, it
wk
then also holds that h(µn ) −→ h(µ).
3.1 Weak convergence and convergence of measures 65
wk
proving that h(µn ) −→ h(µ), as desired.
Lemma 3.1.9 (Scheffé’s lemma). Let (µn ) be a sequence of probability measures on (R, B),
and let µ be another probability measure on (R, B). Assume that there is a measure ν such
that µn = gn · ν for n ≥ 1 and µ = g · ν. If limn→∞ gn (x) = g(x) for ν-almost all x, then
wk
µn −→ µ.
R
Proof. To prove the result, we first argue that limn→∞ |gn − g| dν = 0. To this end, with
x+ = max{0, x} and x− = max{0, −x}, we first note that since both µn and µ are probability
measures, we have
Z Z Z Z Z
0 = gn dν − g dν = gn − g dν = (gn − g)+ dν − (gn − g)− dν,
It therefore suffices to show that this latter tends to zero. To do so, note that
(gn − g)− (x) = max{0, −(gn (x) − g(x))} = max{0, g(x) − gn (x)} ≤ g(x), (3.3)
order to obtain the desired weak convergence from this, let f ∈ Cb (R), we then have
Z Z Z
lim sup f (x) dµn (x) − f (x) dµ(x) ≤ lim sup |f (x)||gn (x) − g(x)| dν(x)
n→∞ n→∞
Z
≤ kf k∞ lim sup |gn (x) − g(x)| dν(x) = 0,
n→∞
R R wk
proving limn→∞ f dµn = f dµ and hence µn −→ µ.
Lemma 3.1.8 shows that weak convergence is conserved under continuous transformations,
a result similar in spirit to Lemma 1.2.6. Lemma 3.1.9 shows that for probability measures
which have densities with respect to the same common measure, almost sure convergence of
66 Weak convergence
the densities is sufficient to obtain weak convergence. This is in several cases a very useful
observation.
Example 3.1.10. Let (xn ) be a sequence of real numbers and consider the corresponding
Dirac measures (εxn ), that is, εxn is the probability measure which accords probability one
to the set {xn } and zero to all Borel subsets in the complement of {xn }. We claim that if xn
converges to x for some x ∈ R, then εxn converges weakly to εx . To see this, take f ∈ Cb (R).
By continuity, we then have
Z Z
lim f dεxn = lim f (xn ) = f (x) = f dεx ,
n→∞ n→∞
Note that the measures (µn ) in Example 3.1.11 are discrete in nature, while the limit measure
is continuous in nature. This shows that qualities such as being discrete or continuous in
nature are not preserved by weak convergence.
Example 3.1.12. Let (ξn ) and (σn ) be two real sequences with limits ξ and σ, respectively,
where we assume that σ > 0. Let µn be the normal distribution with mean ξn and variance
σn2 . We claim that µn converges to µ, where µ denotes the normal distribution with mean ξ
and variance σ 2 . To demonstrate this result, define mappings gn for n ≥ 1 and g by putting
1
gn (x) = σ √ exp(− 2σ12 (x−ξn )2 ) and g(x) = σ√12π exp(− 2σ1 2 (x−ξ)2 ). Then, µn has density
n 2π n
gn with respect to the Lebesgue measure, and µ has density g with respcet to the Lebesgue
measure. As gn converges pointwisely to g, Lemma 3.1.9 shows that µn converges to µ, as
desired. ◦
3.2 Weak convergence and distribution functions 67
In this section, we investigate the connection between weak convergence of probability mea-
sures and convergence of the corresponding cumulative distribution functions. We will show
that weak convergence is not in general equivalent to pointwise convergence of cumulative
distribution functions, but is equivalent to pointwise convergence on a dense subset of R.
Lemma 3.2.1. Let (µn ) be a sequence of probability measures on (R, B), and let µ be some
other probability measure. Assume that µn has cumulative distribution function Fn for n ≥ 1,
wk
and assume that µ has cumulative distribution function F . If µn −→ µ, then it holds that
limn→∞ Fn (x) = F (x) whenever F is continuous at x.
wk
Proof. Assume that µn −→ µ and let x be such that F is continuous at x. Let ε > 0. By
Lemma 3.1.3, there exists h ∈ Cb (R) such that 1[x−2ε,x−] (y) ≤ h(y) ≤ 1(x−3ε,x) (y) for y ∈ R.
Putting f (y) = h(y) for y ≥ x − ε and f (y) = 1 for y < x − ε, we find that 0 ≤ f ≤ 1,
f (y) = 1 for y ≤ x − ε and f (y) = 0 for y > x. Thus, 1(−∞,x−ε] (y) ≤ f (y) ≤ 1(−∞,x] (y) for
R R
y ∈ R. This implies F (x − ε) ≤ f dµ and f dµn ≤ Fn (x), from which we conclude
Z Z
F (x − ε) ≤ f dµ = lim f dµn ≤ lim inf Fn (x). (3.4)
n→∞ n→∞
Similarly, there exists g ∈ Cb (R) such that 0 ≤ g ≤ 1, g(y) = 1 for y ≤ x and g(y) = 0 for
R R
y > x + ε, implying Fn (x) ≤ g dµn and g dµ ≤ F (x + ε) and allowing us to obtain
Z Z
lim sup Fn (x) ≤ lim g dµn = g dµ ≤ F (x + ε). (3.5)
n→∞ n→∞
Combining (3.4) and (3.5), we conclude that for all ε > 0, it holds that
Since F is continuous at x, we may now let ε tend to zero and obtain that lim inf n→∞ Fn (x)
and lim supn→∞ Fn (x) are equal, and the common value is F (x). Therefore, Fn (x) converges
and limn→∞ Fn (x) = F (x). This completes the proof.
The following example shows that in general, weak convergence does not imply convergence
of the cumulative distribution functions in all points. After the example, we prove the gen-
eral result on the correspondence between weak convergence and convergence of cumulative
distribution functions.
68 Weak convergence
1
Example 3.2.2. For each n ≥ 1, let µn be the Dirac measure in n, and let µ be the Dirac
wk
measure at 0. According to Example 3.1.10, µn −→ µ. But with Fn being the cumulative
distribution function for µn and with F being the cumulative distribution function for µ, we
have Fn (0) = 0 for all n ≥ 1, while F (0) = 1, so that limn→∞ Fn (0) 6= F (0). ◦
Theorem 3.2.3. Let (µn ) be a sequence of probability measures on (R, B), and let µ be some
other probability measure. Assume that µn has cumulative distribution function Fn for n ≥ 1,
wk
and assume that µ has cumulative distribution function F . Then µn −→ µ if and only if
there exists a dense subset A of R such that limn→∞ Fn (x) = F (x) for x ∈ A.
wk
Proof. First assume that µn −→ µ, we wish to identify a dense subset of R such that we
have pointwise convergence of the cumulative distribution functions on this set. Let B be
the set of discontinuity points of F , B is then countable, so B c is dense in R. By Lemma
3.2.1, limn→∞ Fn (x) = F (x) whenever x ∈ B c , and so B c satisfies the requirements.
Next, assume that there exists a dense subset A of R such that limn→∞ Fn (x) = F (x) for
wk
x ∈ A. We wish to show that µn −→ µ. To this end, let f ∈ Cb (R), we need to prove
R R
limn→∞ f dµn = f dµ. Fix ε > 0. Recall that F (x) tends to zero and one as x tends
to minus infinity and infinity, respectively. Therefore, we may find a, b ∈ A with a < b such
that limn→∞ Fn (a) = F (a), limn→∞ Fn (b) = F (b), F (a) < ε and F (b) > 1 − ε. For n large
enough, we then also obtain Fn (a) < ε and Fn (b) > 1 − ε. Applying these properties, we
obtain
Z Z
lim sup f 1(−∞,a] dµn − f 1(−∞,a] dµ ≤ lim sup kf k∞ (µn ((−∞, a]) + µ((−∞, a]))
n→∞ n→∞
and similarly,
Z Z
lim sup f 1(b,∞) dµn − f 1(b,∞) dµ ≤ lim sup kf k∞ (µn ((b, ∞)) + µ((b, ∞)))
n→∞ n→∞
Now, f is uniformly continuous on [a, b]. Pick δ > 0 parrying ε for this uniform continuity.
Using that A is dense in R, pick a partition a = x0 < x1 < · · · < xm = b of elements in
3.3 Weak convergence and convergence in probability 69
A such that |xi − xi−1 | ≤ δ for all i ≤ m. We then have |f (x) − f (xi−1 )| ≤ ε whenever
xi−1 < x ≤ xi , and so
Z m
Z X m Z
X
f 1(a,b] dµn − f (xi−1 )1(xi−1 ,xi ] dµn = (f (x) − f (xi−1 ))1(xi−1 ,xi ] (x) dµn (x)
i=1 i=1
m Z
X
≤ |f (x) − f (xi−1 )|1(xi−1 ,xi ] (x) dµn (x)
i=1
P
We begin with a simple equivalence. Note that for x ∈ R, statements such as Xn −→ x and
D
Xn −→ x are equivalent to the statements that limn→∞ P (|Xn − x| > ε) = 0 for ε > 0 and
limn→∞ Ef (Xn ) = f (x) for f ∈ Cb (R), respectively, and so in terms of stochasticity depend
only on the distributions of Xn for each n ≥ 1. In the case of convergence in probability, this
P
is not the typical situation, as we in general have that Xn −→ X is a statement depending
on the multivariate distributions (Xn , X).
P
Lemma 3.3.1. Let (Xn ) be a sequence of random variables, and let x ∈ R. Then Xn −→ x
D
if and only if Xn −→ x.
P D
Proof. By Theorem 1.2.8, we know that if Xn −→ x, then Xn −→ x as well. In order to
D P
prove the converse, assume that Xn −→ x, we wish to show that Xn −→ x. Take ε > 0. By
Lemma 3.1.3, there exists g ∈ Cb (R) such that 1[x−ε/2,x+ε/2] (y) ≤ g(y) ≤ 1(x−ε,x+ε) (y) for
y ∈ R. With f = 1 − g, we then also obtain 1(x−ε,x+ε)c (y) ≤ f (y) ≤ 1[x−ε/2,x+ε/2]c (y), and
in particular, f (x) = 0. By weak convergence, we may then conclude
P
so Xn −→ x, as desired.
Lemma 3.3.2 (Slutsky’s Lemma). Let (Xn , Yn ) be a sequence of random variables, and let
D P D
X be some other variable. If Xn −→ X and Yn −→ 0, then Xn + Yn −→ X.
Proof. Applying Theorem 3.1.7, we find that in order to obtain the result, it suffices to prove
that limn→∞ Ef (Xn + Yn ) = Ef (X) for f ∈ Cbu (R). Fix f ∈ Cbu (R). Note that
≤ lim sup |Ef (Xn + Yn ) − Ef (Xn )| + lim sup |Ef (Xn ) − Ef (X)|
n→∞ n→∞
so it suffices to show that the latter is zero. To this end, take ε > 0, and pick δ > 0 parrying
ε for the uniform continuity of f . We then obtain in particular that for x ∈ R and |y| ≤ δ,
|f (x + y) − f (x)| ≤ ε. We then obtain
|Ef (Xn + Yn ) − Ef (Xn )| ≤ E|f (Xn + Yn ) − f (Xn )| ≤ ε + 2kf k∞ P (|Yn | > δ),
3.3 Weak convergence and convergence in probability 71
P
and as Yn −→ 0, this implies lim supn→∞ |Ef (Xn + Yn ) − Ef (Xn )| ≤ ε. As ε > 0 was
arbitrary, this yields lim supn→∞ |Ef (Xn + Yn ) − Ef (Xn )| = 0. Combining this with (3.8),
we obtain limn→∞ Ef (Xn + Yn ) = Ef (X), as desired.
Theorem 3.3.3. Let (Xn , Yn ) be a sequence of random variables, let X be some other vari-
D P
able and let y ∈ R. Let h : R2 → R be a continuous mapping. If Xn −→ X and Yn −→ y,
D
then h(Xn , Yn ) −→ h(X, y).
Proof. First note that h(Xn , Yn ) = h(Xn , Yn ) − h(Xn , y) + h(Xn , y). Define hy : R → R
by hy (x) = h(x, y). The distribution of h(Xn , y) is then hy (Xn (P )) and the distribution
D
of h(X, y) is hy (X(P )). As we have assumed that Xn −→ X, Lemma 3.1.2 yields that
wk wk
Xn (P ) −→ X(P ). Therefore, as hy is continuous, hy (Xn (P )) −→ hy (X(P )) by Lemma
wk
3.1.8, which by Lemma 3.1.2 implies that h(Xn , y) −→ h(X, y). Therefore, by Lemma 3.3.2,
it suffices to prove that h(Xn , Yn ) − h(Xn , y) converges in probability to zero.
To this end, let ε > 0, we have to show limn→∞ P (|h(Xn , Yn ) − h(Xn , y)| > ε) = 0. We have
assumed that h is continuous. Equipping R2 with the metric d : R2 × R2 → [0, ∞) given by
d((a1 , a2 ), (b1 , b2 )) = |a1 − b1 | + |a2 − b2 |, h is in particular continuous with respect to this
metric on R2 . Now let M, η > 0 and note that h is uniformly continous on the compact set
[−M, M ] × [y − η, y + η]. Therefore, we may pick δ > 0 parrying ε for this uniform continuity,
and we may assume that δ ≤ η. We then have
which yields (|h(Xn , Yn ) − h(Xn , y)| > ε) ⊆ (|Xn | > M ) ∪ (|Yn − y| > δ) and thus
lim sup P (|h(Xn , Yn ) − h(Xn , y)| > ε) ≤ lim sup P (|Xn | > M ) + P (|Yn − y| > δ)
n→∞ n→∞
By Lemma 3.1.6, the latter tends to zero as M tends to infinity. We therefore conclude
that lim supn→∞ P (|h(Xn , Yn ) − h(Xn , y)| > ε) = 0, so h(Xn , Yn ) − h(Xn , y) converges in
probability to zero and the result follows.
72 Weak convergence
Let C denote the complex numbers. In this section, we will associate to each probability
measure µ on (R, B) a mapping ϕ : R → C, called the characteristic function of µ. We will see
that the characteristic function determines the probability measure uniquely, in the sense that
two probability measures with equal characteristic functions in fact are equal. Furthermore,
we will show, and this will be the main result of the section, that weak convergence of
probability measures is equivalent to pointwise convergence of characteristic functions. As
characteristic functions in general are pleasant to work with, both from theoretical and
practical viewpoints, this result is of considerable use.
Before we introduce the characteristic function, we recall some results from complex analysis.
For z ∈ C, we let <(z) and =(z) denote the real and imaginary parts of z, and with i denoting
the imaginary unit, we always have z = <(z) + i=(z). < and = are then mappings from C
to R. Also, for z ∈ C with z = <(z) + i=(z), we define z = <(z) − i=(z) and refer to z as
the complex conjugate of z.
Also recall that we may define the complex exponential by its Taylor series, putting
∞
X zn
ez =
n=0
n!
for any z ∈ C, where the series is absolutely convergent. We then also obtain the relationship
eiz = cos z + i sin z, where the complex cosine and the complex sine functions are defined by
their Taylor series,
∞ ∞
X (−1)n z 2n X (−1)n z 2n+1
cos z = and sin z = .
n=0
(2n)! n=0
(2n + 1)!
In particular, for t ∈ R, we obtain eit = cos t + i sin t, where cos t and sin t here are the
ordinary real cosine and sine functions. This shows that the complex exponential of a purely
imaginary argument yields a point on the unit circle corresponding to an angle of t measured
in radians.
Let (E, E, µ) be a measure space and let f : E → C be a complex valued function defined on
E. Then f (z) = <(f (z))+i=(f (z)). We refer to the mappings z 7→ <(f (z)) and z 7→ =(f (z))
as the real and imaginary parts of f , and denote them by <f and =f , respectively. Endowing
C with the σ-algebra BC generated by the open sets, it also holds that BC is the smallest
σ-algebra making < and = measurable. We then obtain that for any f : E → C, f is E-BC
measurable if and only if both the real and imaginary parts of f are E-B measurable.
3.4 Weak convergence and characteristic functions 73
The space of integrable complex functions is denoted LC (E, E, µ) or simply LC . Note that we
have the inequalities |<f | ≤ |f |, |=f | ≤ |f | and |f | ≤ |<f | + |=f |. Therefore, f is integrable
if and only if |f | is integrable.
Example 3.4.2. Let γ 6= 0 be a real number. Since |eiγt | = 1 for all t ∈ R, t 7→ eiγt is
integrable with respect to the Lebesgue measure on all compact intervals [a, b]. As it holds
that eiγt = cos γt + i sin γt, we obtain
Z b Z b Z b
eiγt dt = cos γt dt + i sin γt dt
a a a
sin γb − sin γa − cos γb + cos γa
= +i
γ γ
−i eiγb − eiγa
= (cos γb + i sin γb − cos γa − i sin γa) = ,
γ iγ
extending the results for the real exponential function to the complex case. ◦
Proof. We first show that for f integrable and z ∈ C, it holds that zf is integrable and
R R R R R
zf dµ = z f dµ. First off, note that |zf | dµ = |z||f | dµ = |z| |f | dµ < ∞, so zf is
integrable. Furthermore,
Z Z
zf dµ = (<(z) + i=(z))(<f + i=f ) dµ
Z
= <(z)<f − =(z)=f + i(<(z)=f + =(z)<f ) dµ
Z Z
= <(z)<f − =(z)=f dµ + i <(z)=f + =(z)<f dµ
Z Z Z Z
= <(z) <f dµ − =(z) =f dµ + i <(z) =f dµ + =(z) <f dµ
Z Z Z
= (<(z) + i=(z)) <f dµ + i =f dµ = z f dµ,
R R R
as desired. Next, we show that for f, g ∈ LC , f + g ∈ LC and f + g dµ = f dµ + g dµ.
First, as |f + g| ≤ |f | + |g|, it follows that f + g ∈ LC . In order to obtain the desired identity
74 Weak convergence
Proof. Recall that for z ∈ C, there exists θ ∈ R such that z = |z|eiθ . Applying this to the
integral f dµ, we obtain | f dµ| = e−iθ f dµ = e−iθ f dµ by Lemma 3.4.3. As the left
R R R R
hand side is real, the right hand side must be real as well. Hence, we obtain
Z Z Z
f dµ = < e−iθ f dµ = <(e−iθ f ) dµ
Z Z Z
≤ |<(e−iθ f )| dµ ≤ |e−iθ f | dµ = |f | dµ,
as desired.
Next, we state versions of the dominated convergence theorem and Fubini’s theorem for
complex mappings.
Theorem 3.4.5. Let (E, E, µ) be a measure space, and let (fn ) be a sequence of measurable
mappings from E to C. Assume that the sequence (fn ) converges µ-almost everywhere to
some mapping f . Assume that there exists a measurable, integrable mapping g : E → [0, ∞)
such that |fn | ≤ g µ-almost everywhere for all n. Then fn is integrable for all n ≥ 1, f is
measurable and integrable, and
Z Z
lim fn dµ = lim fn dµ.
n→∞ n→∞
Proof. As fn converges µ-almost everywhere to f , we find that <fn converges µ-almost every-
where to <f and =fn converges µ-almost everywhere to =f . Furthermore, we have |<fn | ≤ g
and |=fn | ≤ g µ-almost everywhere. Therefore, the dominated convergence theorem for real-
valued mappings yields
Z Z Z
lim fn dµ = lim <fn dµ + i =fn dµ
n→∞ n→∞
Z Z Z
= lim <fn dµ + i lim =fn dµ = lim fn dµ,
n→∞ n→∞ n→∞
3.4 Weak convergence and characteristic functions 75
as desired.
Theorem 3.4.6. Let (E, E, µ) and (F, F, ν) be two σ-finite measure spaces, and assume that
f : E × F → C is E ⊗ F measurable and µ ⊗ ν integrable. Then y 7→ f (x, y) is integrable with
respect to ν for µ-almost all x, the set where this is the case is measurable, and it holds that
Z Z Z
f (x, y) d(µ ⊗ ν)(x, y) = f (x, y) dν(y) dµ(x).
as was to be proven.
We are now ready to introduce the characteristic function of a probability measure on (R, B).
Definition 3.4.7. Let µ be a probability measure on (R, B). The characteristic function for
µ is the function ϕ : R → C defined by ϕ(θ) = eiθx dµ(x).
R
Since |eiθx | = 1 for all values of θ and x, the integral eiθx dµ(x) in Definition 3.4.7 is
R
always well-defined. For a random variable X with distribution µ, we also introduce the
characteristic function ϕ of X as the characteristic function of µ. The characteristic function
ϕ of X may then also be expressed as
Z Z Z
ϕ(θ) = eiθx dµ(x) = eiθx dX(P )(x) = eiθX(ω) dP (ω) = EeiθX .
Lemma 3.4.8. Let µ be a probability measure on (R, B) and assume that ϕ is the charac-
teristic function of µ. The mapping ϕ has the following properties.
(1). ϕ(0) = 1.
and with ϕ(n) denoting the n’th derivative, we have ϕ(n) (θ) = in xn eiθx dµ(x). In particular,
R
Proof. The first claim follows as ϕ(0) = ei0x dµ(x) = µ(R) = 1, and the second claim
R
follows as |ϕ(θ)| = | eiθx dµ(x)| ≤ |eiθx | dµ(x) = 1. Also, the third claim follows since
R R
Z Z
i(−θ)x
ϕ(−θ) = e dµ(x) = cos(−θx) + i sin(−θx) dµ(x)
Z Z
= cos(−θx) dµ(x) + i sin(−θx) dµ(x)
Z Z
= cos(θx) dµ(x) − i sin(θx) dµ(x) = ϕ(θ).
To obtain the fourth claim, let θ ∈ R and let h > 0. We then have
Z Z
|ϕ(θ + h) − ϕ(θ)| = ei(θ+h)x − eiθx dµ(x) = eiθx (eihx − 1) dµ(x)
Z Z
≤ |eiθx ||(eihx − 1)| dµ(x) = |(eihx − 1)| dµ(x),
where limh→0 |(eihx − 1)| dµ(x) = 0 by the dominated convergence theorem. In order to
R
use this to obtain uniform continuity, let ε > 0. Choose δ > 0 so that for 0 ≤ h ≤ δ,
R ihx
|(e − 1)| dν(x) ≤ ε. We then find for any x, y ∈ R with x < y and |x − y| ≤ δ that
|ϕ(y) − ϕ(x)| = |ϕ(x + (y − x)) − ϕ(y)| ≤ ε, and as a consequence, we find that ϕ is uniformly
continuous.
Next, we prove the results on the derivative. We apply an induction argument, and wish
to show for n ≥ 0 that if |x|n dµ(x) is finite, then ϕ is n times continuously differentiable
R
with ϕ(n) = in xn eiθx dµ(x). Noting that the induction start holds, it suffices to prove the
R
3.4 Weak convergence and characteristic functions 77
induction step. Assume that the result holds for n, we wish to prove it for n + 1. Assume
that |x|n+1 dµ(x) is finite. Fix θ ∈ R and h > 0. We then have
R
We wish to apply the dominated convergence theorem to calculate the limit of the above as
h tends to zero. First note that by l’Hôpital’s rule, we have
so the integrand in the final expression of (3.9) converges pointwise to ixn+1 eiθx . Note
Rx Rx
furthermore that since | cos x − 1| = | 0 sin y dy| ≤ |x| and | sin x| = | 0 cos y dy| ≤ |x| for
all x ∈ R, we have
as desired. This proves that ϕ is n + 1 times differentiable, and yields the desired expression
for ϕ(n+1) . By another application of the dominated convergence theorem, we also obtain
that ϕ(n+1) is continuous. This completes the induction proof. As a consequence of this
latter result, it also follows that when |x|n dµ(x) is finite, ϕ(n) (0) = in xn dµ(x). This
R R
Lemma 3.4.9. Assume that X is a random variable with characteristic function ϕ, and let
α, β ∈ R. The variable α + βX has characteristic function φ given by φ(θ) = eiθα ϕ(βθ).
Proof. Noting that φ(θ) = Eeiθ(α+βX) = eiθα EeiβθX = eiθα ϕ(βθ), the result follows.
78 Weak convergence
Next, we show by example how to calculate the characteristic functions of a few distributions.
Example 3.4.10. Let ϕ be the characteristic function of the standard normal distribution,
we wish to obtain a closed-form expression for ϕ. We will do this by proving that ϕ satisfies
a particular differential equation. To this end, let f be the density of the standard normal
distribution, f (x) = √12π exp(− 21 x2 ). Note that for any θ ∈ R, we have by Lemma 3.4.8 that
Z ∞ Z ∞ Z ∞
ϕ(θ) = eiθx f (x) dx = e−iθx f (−x) dx = e−iθx f (x) dx = ϕ(−θ) = ϕ(θ).
−∞ −∞ −∞
R∞
As a consequence, =ϕ(θ) = 0, so ϕ(θ) = −∞
cos(θx)f (x) dx. Next, note that
d
cos(θx)f (x) = | − x sin(θx)f (x) ≤ |x|f (x),
dθ
which is integrable with respect to the Lebesgue measure. Therefore, ϕ(θ) is differentiable
for all θ ∈ R, and the derivative may be calculated by an exchange of limits. Recalling that
f 0 (x) = −xf (x), we obtain
Z ∞ Z ∞
0 d d
ϕ (θ) = cos(θx)f (x) dx = cos(θx)f (x) dx
dθ −∞ −∞ dθ
Z ∞ Z ∞
=− x sin(θx)f (x) dx = sin(θx)f 0 (x) dx.
−∞ −∞
since limM →∞ f (M ) = limM →∞ f (−M ) = 0. Thus, ϕ satisfies ϕ0 (θ) = −θϕ(θ). All the
solutions to this differential equation are of the form θ 7→ c exp(− 21 θ2 ) for some c ∈ R, so we
conclude that there exists c ∈ R such that ϕ(θ) = c exp(− 21 θ2 ) for all θ ∈ R. As ϕ(0) = 1,
this implies ϕ(θ) = exp(− 21 θ2 ).
By Lemma 3.4.9, we then also obtain as an immediate corollary that the characteristic
function for the normal distribution with mean ξ and variance σ 2 , where σ > 0, is given by
θ 7→ exp(iξθ − 12 σ 2 θ2 ). ◦
3.4 Weak convergence and characteristic functions 79
Example 3.4.11. In this example, we derive the characteristic function for the standard
exponential distribution. Let ϕ denote the characteristic function, we then have
Z ∞ Z ∞
−x
ϕ(θ) = cos(θx)e dx + i sin(θx)e−x dx,
0 0
and we need to evaluate both of these intgrals. In order to do so, fix a, b ∈ R and note that
d
a cos(θx)e−x + b sin(θx)e−x = (bθ − a) cos(θx)e−x − (aθ + b) sin(θx)e−x .
dx
Next, note that the pair of equations bθ − a = c and −(aθ + b) = d have unique solutions in
a and b given by a = (−c − dθ)/(1 + θ2 ) and b = (cθ − d)/(1 + θ2 ), such that we obtain
d −c − dθ −x cθ − d −x
cos(θx)e + sin(θx)e = c cos(θx)e−x + d sin(θx)e−x . (3.10)
dx 1 + θ2 1 + θ2
Using (3.10) with c = 1 and d = 0, we conclude that
Z ∞ M
−x − cos(θx) −x θ sin(θx) −x 1
cos(θx)e dx = lim 2
e + 2
e = , (3.11)
0 M →∞ 1+θ 1+θ 0 1 + θ2
and likewise, using (3.10) with c = 0 and d = 1, we find
Z ∞ M
−x −θ cos(θx) −x − sin(θx) −x θ
sin(θx)e dx = lim 2
e + 2
e = . (3.12)
0 M →∞ 1+θ 1+θ 0 1 + θ2
Combining (3.11) and (3.12), we conclude
1 θ 1 + iθ 1 + iθ 1
ϕ(θ) = 2
+i 2
= 2
= = .
1+θ 1+θ 1+θ (1 + iθ)(1 − iθ) 1 − iθ
By Lemma 3.4.9, we then also obtain that the exponential distribution with mean λ for λ > 0
1
has characteristic function θ 7→ 1−iλθ . ◦
Example 3.4.12. We wish to derive the characteristic function for the Laplace distribution.
Denote by ϕ this characteristic function. Using the relationships sin(−θx) = − sin(θx) and
cos(−θx) = cos(θx) and recalling (3.10), we obtain
Z ∞ Z ∞
1 −|x| 1
ϕ(θ) = cos(θx) e dx + i sin(θx) e−|x| dx
−∞ 2 −∞ 2
Z ∞ Z ∞
1 −|x|
= cos(θx) e dx = cos(θx)e−x dx
−∞ 2 0
M
− cos(θx) −x θ sin(θx) −x 1
= lim e + e = .
M →∞ 1 + θ2 1 + θ2 0 1 + θ2
◦
Next, we introduce the convolution of two probability measures and argue that characteristic
functions interact in a simple manner with convolutions.
80 Weak convergence
Definition 3.4.13. Let µ and ν be two probability measures on (R, B). The convolution
µ ∗ ν of µ and ν is the probability measure h(µ ⊗ ν) on (R, B), where h : R2 → R is given by
h(x, y) = x + y.
The following lemma gives an important interpretation of the convolution of two probability
measures.
Lemma 3.4.14. Let X and Y be two independent random variables X and Y defined on
the same probability space. Assume that X has distribution µ and that Y has distribution ν.
Then X + Y has distribution µ ∗ ν.
so µ ∗ ν is the distribution of X + Y .
Lemma 3.4.15. Let µ and ν be probability measures on (R, B) with characteristic functions
ϕ and φ. Then µ ∗ ν has characteristic function θ 7→ ϕ(θ)φ(θ).
As mentioned earlier, two of our main objectives in this section is to prove that probability
measures are uniquely determined by characteristic functions, and to prove that weak con-
vergence is equivalent to pointwise convergence of characteristic functions. To show these
results, we will employ a method based on convolutions with normal distributions.
3.4 Weak convergence and characteristic functions 81
We will need three technical lemmas. Lemma 3.4.16 shows that convoluting a probability
measure with a normal distribution approximates the original probability measure when the
mean in the normal distribution is zero and the variance is small. Lemma 3.4.17 will show
that if we wish to prove weak convergence of some sequence (µn ) to some µ, it suffices to
prove weak convergence when both the sequence and the limit are convoluted with a normal
distribution with mean zero and small variance. Intuitively, this is not a surprising result. Its
usefulness is clarified by Lemma 3.4.18, which states that the convolution of any probability
measure µ with a particular normal distribution has density with respect to the Lebesgue
measure, and the density can be obtained in closed form in terms of the characteristic function
of the measure µ. This is a frequently seen feature of convolutions: The convolution of two
probability measures in general inherits the regularity properties of each of the convoluted
measures, in this particular case, the regularity property of having a density with respect
to the Lebesgue measure. Summing up, Lemma 3.4.16 shows that convolutions with small
normal distributions are close to the original probability measure, Lemma 3.4.17 shows that
in order to prove weak convergence, it suffices to consider probability measures convoluted
with normal distributions, and Lemma 3.4.18 shows that such convolutions possess good
regularity properties.
Lemma 3.4.16. Let µ be a probability measure on (R, B). Let ξk be the normal distribution
wk
with mean zero and variance k1 . Then µ ∗ ξk −→ µ.
Proof. Consider a probability space endowed with two independent random variables X
and Y , where X has distribution µ and Y follows a standard normal distribution. Define
Yk = √1k Y , then Yk is independent of X and has distribution ξk . As a consequence, we also
obtain P (|Yk | > δ) ≤ δ −2 E|Yk |2 = δ −2 /k by Lemma 1.2.7, so Yk converges in probability to
D
0. Therefore, Lemma 3.3.2 yields X + Yk −→ X. However, by Lemma 3.4.14, X + Yk has
wk
distribution µ ∗ ξk . Thus, we conclude that µ ∗ ξk −→ µ.
Lemma 3.4.17. Let (µn ) be a sequence of probability measures on (R, B), and let µ be some
other probability measure. Let ξk be the normal distribution with mean zero and variance k1 .
wk wk
If it holds for all k ≥ 1 that µn ∗ ξk −→ µ ∗ ξk , then µn −→ µ as well.
R R
Proof. According to Theorem 3.1.7, it suffices to show that limn→∞ f dµn = f dµ for
f ∈ Cbu (R). To do so, let f ∈ Cbu (R). Fix n, k ≥ 1. For convenience, assume given
a probability space with independent random variables Xn , Yk and X, such that Xn has
82 Weak convergence
We will prove that the limes superior of the left-hand side is zero by bounding the limes
superior of each of the three terms on the right-hand side. First note that by our assumptions,
R R
limn→∞ |Ef (Xn + Yk ) − Ef (X + Yk )| = limn→∞ | f d(µn ∗ ξk ) − f d(µ ∗ ξk )| = 0. Now
consider some ε > 0. Pick δ parrying ε for the uniform continuity of f . We then obtain
and similarly, |Ef (X) − Ef (X + Yk )| ≤ ε + 2kf k∞ P (|Yk | > δ) as well. Combining these
R R
observations with (3.13), we get lim supn→∞ | f dµn − f dµ| ≤ 2ε + 4kf k∞ P (|Yk | > δ).
By Lemma 1.2.7, P (|Yk | > δ) ≤ δ −2 E|Yk |2 = δ −2 /k, so limk→∞ P (|Yk | > δ) = 0. All in
R R
all, this yields lim supn→∞ | f dµn − f dµ| ≤ 2ε. As ε > 0 was arbitrary, this proves
R R wk
limn→∞ f dµn = f dµ, and thus µn −→ µ by Theorem 3.1.7.
Lemma 3.4.18. Let µ be some probability measure, and let ξk be the normal distribution
with mean zero and variance k1 . Let ϕ be the characteristic function for µ. The probability
measure µ ∗ ξk then has density f with respect to the Lebesgue measure, and the density is
given by
Z
1 1
f (u) = ϕ(x) exp − x2 e−iux dx.
2π 2k
This implies that µn ∗√ξk has density with respect to the Lebesgue measure, and the density f
R k
exp − k2 (y − u)2 dµ(y). By Example 3.4.10, exp(− k2 (y −u)2 ) is the
is given by f (u) = √2π
characteristic function of the normal distribution with mean zero and variance k, evaluated
in y − u. Therefore, we have
Z 2
k 1 x
exp − (y − u)2 = ei(y−u)x √ exp − dx.
2 2πk 2k
Substituting this in our expression for the density and applying Fubini’s theorem, we obtain
Z √ Z 2
k i(y−u)x 1 x
f (u) = √ e √ exp − dx dµ(y)
2π 2πk 2k
Z Z 2
1 x
= ei(y−u)x exp − dx dµ(y)
2π 2k
Z Z
1 iyx 1 2 −iux
= e dµ(y) exp − x e dx
2π 2k
Z
1 1
= ϕ(x) exp − x2 e−iux dx,
2π 2k
proving the lemma.
With Lemma 3.4.17 and Lemma 3.4.18 in hand, our main results on characteristic functions
now follow without much difficulty.
Theorem 3.4.19. Let µ and ν be probability measures on (R, B) with characteristic functions
ϕ and φ, respectively. Then µ and ν are equal if and only if ϕ and φ are equal.
so ϕ and φ are equal. Conversely, assume that ϕ and φ are equal. Let ξk be the normal
distribution with mean zero and variance k1 . By Lemma 3.4.18, µ ∗ ξk and ν ∗ ξk both have
densities with respect to the Lebesgue measure, and the densities fk and gk are given by
Z
1 1
fk (u) = ϕ(x) exp − x2 e−iux dx and
2π 2k
Z
1 1
gk (u) = φ(x) exp − x2 e−iux dx,
2π 2k
respectively. As ϕ and φ are equal, fk and gk are equal, and so µ ∗ ξk and ν ∗ ξk are equal.
wk wk
By Lemma 3.4.16, µ ∗ ξk −→ µ and ν ∗ ξk −→ ν. As µ ∗ ξk and ν ∗ ξk are equal by our above
wk wk
observations, we find that µ ∗ ξk −→ µ and µ ∗ ξk −→ ν, yielding µ = ν by Lemma 3.1.5.
84 Weak convergence
Theorem 3.4.20. Let (µn ) be a sequence of probability measures on (R, B), and let µ be
some other probability measure. Assume that µn has characteristic function ϕn and that µ
wk
has characteristic function ϕ. Then µn −→ µ if and only if limn→∞ ϕn (θ) = ϕ(θ) for all
θ ∈ R.
wk
Proof. First assume that µn −→ µ. Fix θ ∈ R. Since x 7→ cos(θx) and x 7→ sin(θx) are in
Cb (R), we obtain
Z Z
lim ϕn (θ) = lim cos(θx) dµn (x) + i sin(θx) dµn (x)
n→∞ n→∞
Z Z
= cos(θx) dµ(x) + i sin(θx) dµ(x) = ϕ(θ),
as desired. This proves one implication. It remains to prove that if the characteristic functions
converge, the probability measures converge weakly.
In order to do so, assume that limn→∞ ϕn (θ) = ϕ(θ) for all θ ∈ R. We will use Lemma 3.4.17
and Lemma 3.4.18 to prove the result. Let ξk be the normal distribution with mean zero
and variance k1 . By Lemma 3.4.18, µn ∗ ξk and µ ∗ ξk both have densities with respect to the
Lebesgue measure, and the densities fnk and fk are given by
Z
1 1 2 −iux
fnk (u) = ϕn (x) exp − x e dx and
2π 2k
Z
1 1
fk (u) = ϕ(x) exp − x2 e−iux dx,
2π 2k
respectively. Since |ϕn | and |ϕ| are bounded by one, the dominated convergence theorem
yields limn→∞ fnk (u) = fk (u) for all k ≥ 1 and u ∈ R. By Lemma 3.1.9, we may then
wk wk
conclude µn ∗ ξk −→ µ ∗ ξk for all k ≥ 1, and Lemma 3.4.17 then shows that µn −→ µ, as
desired.
For the following corollary, we introduce Cb∞ (R) as the set of continuous, bounded functions
f : R → R which are differentiable infinitely often with bounded derivatives.
Corollary 3.4.21. Let (µn ) be a sequence of probability measures on (R, B), and let µ be
wk R R
some other probability measure. Then µn −→ µ if and only if limn→∞ f dµn = f dµ for
f ∈ Cb∞ (R).
wk
Proof. As Cb∞ (R) ⊆ Cb (R), it is immediate that if µn −→ µ, then limn→∞ f dµn = f dµ
R R
for f ∈ Cb∞ (R). To show the converse implication, assume that limn→∞ f dµn = f dµ for
R R
3.5 Central limit theorems 85
f ∈ Cb∞ (R). In particular, it holds for θ ∈ R that limn→∞ sin(θx) dµn (x) = sin(θx) dµ(x)
R R
R R
and limn→∞ cos(θx) dµn (x) = cos(θx) dµ(x). Letting ϕn and ϕ denote the characteristic
functions for µn and µ, respectively, we therefore obtain
Z Z
lim ϕn (θ) = lim cos(θx) dµn (x) + sin(θx) dµn (x)
n→∞ n→∞
Z Z
= cos(θx) dµ(x) + sin(θx) dµ(x) = ϕ(θ)
wk
for all θ ∈ R, so that Theorem 3.4.20 yields µn −→ µ, as desired.
In this section, we use our results from Section 3.4 to prove Lindeberg’s central limit theorem,
which gives sufficient requirements for a normalized sum of independent variables to be
approximated by a normal distribution in a weak convergence sense. This is one of the main
classical results in the theory of weak convergence.
The proof relies on proving pointwise convergence of characteristic functions and applying
Theorem 3.4.20. In order to prove such pointwise convergence, we will be utilizing some
finer properties of the complex exponential, as well as a particular inequality for complex
numbers. We begin by proving these auxiliary results, after which we prove the central limit
theorem for the case of independent and identically distributed random variables. This result
is weaker than the Lindeberg central limit theorem to be proven later, but the arguments
applied illustrate well the techniques to be used in the more difficult proof of Lindeberg’s
central limit theorem, which is given afterwards.
Lemma 3.5.1. Let z1 , . . . , zn and w1 , . . . , wn be complex numbers with |zi | ≤ 1 and |wi | ≤ 1
Qn Qn Pn
for all i = 1, . . . , n. It then holds that | i=1 zi − i=1 wi | ≤ i=1 |zi − wi |.
86 Weak convergence
Proof. To prove the first inequality, we apply a first order taylor expansion of the exponential
mapping around zero. Fix x ∈ R, by Taylor’s theorem we then find that there exists ξ(x)
between zero and x such that exp(x) = 1 + x + 21 exp(ξ(x))x2 , which for x ≤ 0 yields
| exp(x) − (1 + x)| ≤ 12 | exp(ξ(x))x2 | ≤ 12 x2 . This proves the first inequality.
Considering the second inequality, recall that eix = cos x + i sin x. We therefore obtain
Recalling that cos0 = − sin, cos00 = − cos, sin0 = cos and sin00 = − sin, first order Taylor
expansions around zero yield the existence of ξ ∗ (x) and ξ ∗∗ (x) between zero and x such that
cos x = 1 − 1
2 cos(ξ ∗ (x))x2 and sin x = x − 1
2 sin(ξ ∗∗ (x))x2 ,
cos x = 1 − 12 x2 + 1
6 sin(η ∗ (x))x3 and sin x = x − 1
6 cos(η ∗∗ (x))x3 ,
3.5 Central limit theorems 87
allowing us to obtain
The combination of Lemma 3.5.1, Lemma 3.5.2 and Theorem 3.4.20 is sufficient to obtain
the following central limit theorem for independent and identically distributed variables.
Theorem 3.5.3 (Classical central limit theorem). Let (Xn ) be a sequence of independent
and identically distributed random variables with mean ξ and variance σ 2 , where σ > 0. It
then holds that
n
1 X Xk − ξ D
√ −→ N (0, 1),
n σ
k=1
Proof. It suffices to consider the case where ξ = 0 and σ 2 = 1. In this case, we have to
Pn D
argue that √1n k=1 Xk −→ N (0, 1). Denote by ϕ the common characteristic function of
Pn
Xn for n ≥ 1, and denote by ϕn the characteristic function of √1n k=1 Xk . Lemma 3.4.15
√
and Lemma 3.4.9 show that ϕn (θ) = ϕ(θ/ n)n . Recalling from Example 3.4.10 that the
standard normal distribution has characteristic function θ 7→ exp(− 12 θ2 ), Theorem 3.4.20
yields that in order to prove the result, it suffices to show for all θ ∈ R that
√ 2
lim ϕ(θ/ n)n = e−θ /2 . (3.14)
n→∞
To do so, first note that by Lemma 3.4.8 and Lemma 3.5.1 we obtain
√ √
|ϕ(θ/ n)n − exp(− 12 θ2 )| = |ϕ(θ/ n)n − exp(− 2n
1 2 n
θ ) |
√ 1 2
≤ n|ϕ(θ/ n) − exp(− 2n θ )|. (3.15)
Now, as the variables (Xn ) have second moment, we have from Lemma 3.4.8 that ϕ is two
times continuously differentiable with ϕ(0) = 1, ϕ0 (0) = 0 and ϕ00 (0) = −1. Therefore, a
first-order Taylor expansion shows that for each θ ∈ R, there exists ξ(θ) between 0 and θ
such that ϕ(θ) = ϕ(0) + ϕ0 (0)θ + 21 ϕ00 (ξ(θ))θ2 = 1 + 21 θ2 ϕ00 (ξ(θ)). In particular, this yields
√ 1 2 00
√
ϕ(θ/ n) = 1 + 2n θ ϕ (ξ(θ/ n))
√
= 1 − 2n 1 2
θ + 2n θ (1 + ϕ00 (ξ(θ/ n))).
1 2
(3.16)
88 Weak convergence
Combining (3.15) and (3.16) and applying the first inequality of Lemma 3.5.2, we obtain
√ √
|ϕ(θ/ n)n − exp(− 12 θ2 )| ≤ n|1 − 1 2
2n θ + 1 2
+ ϕ00 (ξ(θ/ n))) − exp(− 2n
2n θ (1
1 2
θ )|
1 2 1 2 1 2 00
√
≤ n|1 − 2n θ− exp(− 2n θ )| + 2 θ |1 + ϕ (ξ(θ/ n))|
√
≤ 2 ( 2n θ ) + 12 θ2 |1 + ϕ00 (ξ(θ/ n))|
n 1 2 2
√
= 8n θ + 12 θ2 |1 + ϕ00 (ξ(θ/ n))|.
1 4
(3.17)
√ √
Now, as n tends to infinity, θ/ n tends to zero, and so ξ(θ( n)) tends to zero. As ϕ00
√
by Theorem 3.4.8 is continuous with ϕ00 (0) = −1, this implies limn→∞ ϕ00 (ξ(θ/ n)) = −1.
√
Therefore, we obtain from (3.17) that lim supn→∞ |ϕ(θ/ n)n − exp(− 21 θ2 )| = 0, proving
Pn D
(3.14). As a consequence, Theorem 3.4.20 yields √1n k=1 Xk −→ N (0, 1). This concludes
the proof.
Theorem 3.5.3 and its proof demonstrates that in spite of the apparently deep nature of
the central limit theorem, the essential ingredients in its proof are simply first-order Taylor
expansions, bounds on the exponential function and Theorem 3.4.20. Next, we will show how
to extend Theorem 3.5.3 to the case where the random variables are not necessarily identically
distributed. The suitable framework for the statement of such more general results is that of
triangular arrays.
Definition 3.5.4. A triangular array is a double sequence (Xnk )n≥k≥1 of random variables.
Let (Xnk )n≥k≥1 be a triangular array. We think of (Xnk )n≥k≥1 as ordered in the shape of a
triangle as follows:
X11
X21 X22
X31 X32 X33
.. .. .. ..
. . . .
Pn
We may then define the row sums by putting Sn = k=1 Xnk , and we wish to establish
conditions under which Sn converges in distribution to a normally distributed limit. In
general, we will consider the case where (Xnk )k≤n is independent for each n ≥ 1, where
EXnk = 0 for all n ≥ k ≥ 1 and where limn→∞ V Sn = 1. In this case, it is natural to hope
that under suitable regularity conditions, Sn converges in distribution to a standard normal
distribution. The following example shows how the case considered in Theorem 3.5.3 can be
put in terms of a triangular array.
3.5 Central limit theorems 89
2
Pn
Proof. We define σnk = V Xnk and ηn2 = k=1 σnk 2
. Our strategy for the proof will be similar
to that for the proof of Theorem 3.5.3. Let ϕnk be the characteristic function of Xnk , and let
ϕn be the characteristic function of Sn . As (Xnk )k≤n is independent for each n ≥ 1, Lemma
Qn
3.4.15 shows that ϕn (θ) = k=1 ϕnk (θ). Recalling Example 3.4.10, we find that by Theorem
3.4.20, in order to prove the theorem, it suffices to show for all θ ∈ R that
n
Y
lim ϕnk (θ) = exp(− 21 θ2 ). (3.19)
n→∞
k=1
First note that by the triangle inequality, Lemma 3.4.8 and Lemma 3.5.1, we obtain
n
Y n
Y
ϕnk (θ) − exp(− 12 θ2 ) ≤ exp(− 12 ηn2 θ2 ) − exp(− 12 θ2 ) + ϕnk (θ) − exp(− 12 ηn2 θ2 )
k=1 k=1
n
X
≤ exp(− 21 ηn2 θ2 ) − exp(− 12 θ2 ) + 2 2
ϕnk (θ) − exp(− 12 σnk θ ) .
k=1
where the former term tends to zero, since limn→∞ ηn = 1 by our assumptions. We wish to
show that the latter term also tends to zero. By Lemma 3.5.2, we have
2 2 2 2 2 2 2 2
|ϕnk (θ) − exp(− 12 σnk θ )| ≤ |ϕnk (θ) − (1 − 21 σnk θ )| + | exp(− 12 σnk θ ) − (1 − 12 σnk θ )|
2 2 2 2 2
≤ |ϕnk (θ) − (1 − 12 σnk θ )| + 12 ( 21 σnk θ ) ,
90 Weak convergence
such that
n
X n
X n
X
2 2 2 2 1 1 2 2 2
ϕnk (θ) − exp(− 12 σnk θ ) ≤ |ϕnk (θ) − (1 − 21 σnk θ )| + 2 | 2 σnk θ |
k=1 k=1 k=1
n n
X
2 2 θ4 X 4
= |ϕnk (θ) − (1 − 21 σnk θ )| + σnk .
8
k=1 k=1
Combining our conclusions, we find that (3.19) follows if only we can show
n
X n
X
2 2 4
lim |ϕnk (θ) − (1 − 21 σnk θ )| = 0 and lim σnk = 0. (3.20)
n→∞ n→∞
k=1 k=1
2 2
Consider the first limit in (3.20). Fix c > 0. As EXnk = 0 and EXnk = σnk , we may apply
the two final inequalities of Lemma 3.5.2 to obtain
n
X n
X
2 2
|ϕnk (θ) − (1 − 12 σnk θ )| = |EeiθXnk − 1 − iEXnk + 21 θ2 EXnk
2
|
k=1 k=1
Xn
≤ E|eiθXnk − 1 − iXnk + 12 θ2 Xnk
2
|
k=1
Xn
≤ E1(|Xnk |≤c) 13 |θXnk |3 + E1(|Xnk |>c) 23 |θXnk |2
k=1
n
c|θ|3 2 3θ2 X 2
η + ≤ E1(|Xnk |>c) Xnk . (3.21)
3 n 2
k=1
Pn 2
Now, by our asumption (3.18), limn→∞ k=1 E1(|Xnk |>c) Xnk = 0, while we also have
3 2 3
limn→∞ (1/3)c|θ| ηn = (1/3)c|θ| . Applying these results with the bound (3.21), we obtain
n
X
2 2 c|θ|3
lim sup |ϕnk (θ) − (1 − 12 σnk θ )| ≤ ,
n→∞ 3
k=1
Pn
and as c > 0 was arbitrary, this yields limn→∞ k=1 |ϕnk (θ) − (1 − 12 σnk 2 2
θ )| = 0, as desired.
For the second limit in (3.20), we note that for all c > 0, it holds that
X n Xn
4 2 2
σnk ≤ max σnk σnk = ηn2 max EXnk 2
k≤n k≤n
k=1 k=1
The conditions given in Theorem 3.5.6 are in many cases sufficient to obtain convergence
in distribution to the standard normal distribution. The main important condition (3.18) is
known as Lindeberg’s condition. The condition, however, is not always easy to check. The
following result yields a central limit theorem where the conditions are less difficult to verify.
Here, the condition (3.22) is known as Lyapounov’s condition.
Theorem 3.5.7 (Lyapounov’s central limit theorem). Let (Xnk )n≥k≥1 be a triangular array
of variables with third moment. Assume that for each n ≥ 1, the family (Xnk )k≤n is inde-
Pn
pendent and assume that EXnk = 0 for all n ≥ k ≥ 1. With Sn = k=1 Xnk , assume that
limn→∞ V Sn = 1. Finally, assume that there is δ > 0 such that
n
X
lim E|Xnk |2+δ = 0. (3.22)
n→∞
k=1
D
It then holds that Sn −→ N (0, 1), where N (0, 1) denotes the standard normal distribution.
Proof. We note that for c > 0, it holds that |Xnk | > c implies 1 ≤ |Xnk |δ /cδ and so
n n n
X
2
X 1 X
E1(|Xnk |>c) Xnk ≤ E1(|Xnk |>c) c1δ |Xnk |2+δ ≤ δ E|Xnk |2+δ ,
c
k=1 k=1 k=1
so Lyapounov’s condition (3.22) implies Lindeberg’s condition (3.18). Therefore, the result
follows from Theorem 3.5.6.
In order to apply Theorem 3.5.7, we require that the random variables in the triangular array
have third moments. In many cases, this requirement is satisfied, and so Lyapounov’s central
limit theorem is frequently useful. However, the moment condition is too strong to obtain
the classical central limit theorem of Theorem 3.5.3 as a corollary. As the following example
shows, this theorem in fact does follow as a corollary from the stronger Lindeberg’s central
limit theorem.
Example 3.5.8. Let (Xn ) be a sequence of independent and identically distributed random
variables with mean ξ and variance σ 2 , where σ > 0. As in Example 3.5.5, we define a
triangular array by putting Xnk = √1n Xkσ−ξ for n ≥ k ≥ 1. The elements of each row are
then independent, with EXnk = 0, and the row sums of the triangular array are given by
Pn Pn
Sn = k=1 Xnk = √1n k=1 Xkσ−ξ and satisfy V Sn = 1. We obtain for c > 0 that
n n
X
2
X (Xk − ξ)2
lim E1(|Xnk |>c) Xnk = lim E1(|Xk −ξ|>cσ√n)
n→∞ n→∞ σ2 n
k=1 k=1
1 √ (X − ξ)2 = 0,
= lim E1 1
n→∞ σ 2 (|X1 −ξ|>cσ n)
92 Weak convergence
by the dominated convergence theorem, since X1 has second moment. Thus, we conclude
that the triangular array satisfies Lindeberg’s condition, and therefore, Theorem 3.5.6 applies
Pn D
and yields √1n k=1 Xkσ−ξ = Sn −→ N (0, 1), as in Theorem 3.5.3. ◦
In Section 3.5, we saw examples of particular normalized sums of random variables converging
to a standard normal distribution. The intuitive interpretation of these results is that the
non-normalized sums approximate normal distributions with nonstandard parameters. In
order to easily work with this idea, we in this section introduce the notion of asymptotic
normality.
Definition 3.6.1. Let (Xn ) be a sequence of random variables, and let ξ and σ be real with
σ > 0. We say that Xn is asymptotically normal with mean ξ and variance n1 σ 2 if it holds
√ D
that n(Xn − ξ) −→ N (0, σ 2 ), where N (0, σ 2 ) denotes the normal distribution with mean
zero and variance σ 2 . If this is the case, we write
as 1
Xn ∼ N ξ, σ 2 . (3.23)
n
The results of Theorem 3.5.3 can be restated in terms of asymptotic normality as fol-
lows. Assume that (Xn ) is a sequence of independent and identically distributed random
variables with mean ξ and variance σ 2 , where σ > 0. Theorem 3.5.3 then states that
Pn Xk −ξ D Pn D
√1 −→ N (0, 1). By Lemma 3.1.8, this implies √1n k=1 (Xk − ξ) −→ N (0, σ 2 ),
n k=1 σ
and so
n
! ! n
√ 1X 1 X D
n Xk − ξ = √ (Xk − ξ) −→ N (0, σ 2 ),
n n
k=1 k=1
1
Pn as 1 2
which by Definition 3.6.1 corresponds to n k=1 Xk ∼ N (ξ, n σ ). The intuitive content of
n
this statement is that as n tends to infinity, the average n1 k=1 Xk is approximated by a
P
normal distribution with the same mean and variance as the empirical average, namely ξ and
1 2
nσ .
We next show two properties of asymptotic normality, namely that for asymptotically nor-
mal sequences (Xn ), Xn converges in probability to the mean, and we show that asymptotic
normality is preserved by transformations with certain mappings. These results are of con-
siderable practical importance when analyzing the asymptotic properties of estimators based
on independent and identically distributed samples.
3.6 Asymptotic normality 93
Lemma 3.6.2. Let (Xn ) be a sequence of random variables, and let ξ and σ be real with
σ > 0. Assume that Xn is asymptotically normal with mean ξ and variance n1 σ 2 . It then
P
holds that Xn −→ ξ.
√ D
Proof. Fix ε > 0. As Xn is asymptotically normal, n(Xn − ξ) −→ N (0, σ 2 ), so Lemma
√
3.1.6 yields limM →∞ supn≥1 P ( n|Xn − ξ| ≥ M ) = 0. Now let M > 0, we then have
√ √
lim sup P (|Xn − ξ| ≥ ε) = lim sup P ( n|Xn − ξ| ≥ nε)
n→∞ n→∞
√
≤ lim sup P ( n|Xn − ξ| ≥ M )
n→∞
√
≤ sup P ( n|Xn − ξ| ≥ M ),
n≥1
and as M > 0 was arbitrary, this implies lim supn→∞ P (|Xn − ξ| ≥ ε) = 0. As ε > 0 was
P
arbitrary, we obtain Xn −→ ξ.
Theorem 3.6.3 (The delta method). Let (Xn ) be a sequence of random variables, and let ξ
and σ be real with σ > 0. Assume that Xn is asymptotically normal with mean ξ and variance
1 2
n σ . Let f : R → R be measurable and differentiable in ξ. Then f (Xn ) is asymptotically
normal with mean f (ξ) and variance n1 σ 2 f 0 (ξ)2 .
√ D
Proof. By our assumptions, n(Xn − ξ) −→ N (0, σ 2 ). Our objective is to demonstrate that
√ D
n(f (Xn ) − f (ξ)) −→ N (0, σ 2 f 0 (ξ)2 ). Note that when defining R : R → R by putting
R(x) = f (x) − f (ξ) − f 0 (ξ)(x − ξ), we obtain f (x) = f (ξ) + f 0 (ξ)(x − ξ) + R(x), and in
particular
√ √
n(f (Xn ) − f (ξ)) = n(f 0 (ξ)(Xn − ξ) + R(Xn ))
√ √
= f 0 (ξ) n(Xn − ξ) + nR(Xn ) (3.24)
√ D √ D
As n(Xn − ξ) −→ N (0, σ 2 ), Lemma 3.1.8 shows that f 0 (ξ) n(Xn − ξ) −→ N (0, σ 2 f 0 (ξ)2 ).
√ P
Therefore, by Lemma 3.3.2, the result will follow if we can prove nR(Xn ) −→ 0. To this
end, let ε > 0. Note that as f is differentiable at ξ, we have
R(x) f (x) − f (ξ)
lim = lim − f 0 (ξ) = 0.
x→ξ x − ξ x→ξ x−ξ
Defining r(x) = R(x)/(x − ξ) when x 6= ξ and r(ξ) = 0, we then find that r is measurable and
continuous at ξ, and R(x) = (x − ξ)r(x). In particular, there exists δ > 0 such that whenever
|x − ξ| < δ, we have |r(x)| < ε. It then also holds that if |r(x)| ≥ ε, we have |x − ξ| ≥ δ. From
this and Lemma 3.6.2, we get lim supn→∞ P (|r(Xn )| ≥ ε) ≤ lim supn→∞ P (|Xn −ξ| ≥ δ) = 0,
94 Weak convergence
P
so r(Xn ) −→ 0. As the multiplication mapping (x, y) 7→ xy is continuous, we obtain by
√ √ D
Theorem 3.3.3 that nR(Xn ) = n(Xn − ξ)r(Xn ) −→ 0, and so by Lemma 3.3.1, we get
√ P
nR(Xn ) −→ 0. Combining our conclusions with (3.24), Lemma 3.3.2 now shows that
√ D
n(f (Xn ) − f (ξ)) −→ N (0, σ 2 f 0 (ξ)2 ), completing the proof.
Using the preceeding results, we may now give an example of a practical application of the
central limit theorem and asymptotic normality.
Example 3.6.4. As in Example 1.5.4, consider a measurable space (Ω, F) endowed with a
sequence of random variables (Xn ). Assume given for each ξ ∈ R a probability measure Pξ
such that for the probability space (Ω, F, Pξ ), (Xn ) consists of independent and identically
distributed variables with mean ξ and unit variance. We may then define an estimator of
Pn
the mean by putting ξˆn = n1 k=1 Xk . As the variables have second moment, Theorem 3.5.3
shows that ξˆn is asymptotically normal with mean ξ and variance n1 .
This intuitively gives us some information about the distribution of ξˆn for large n ≥ 1. In
order to make practical use of this, let 0 < γ < 1. We consider the problem of obtaining a
confidence interval for the parameter ξ with confidence level approximating γ as n tends to
infinity. The statement that ξˆn is asymptotically normal with the given parameters means
√ D
that n(ξˆn − ξ) −→ N (0, 1). With Φ denoting the cumulative distribution function for
√
the standard normal distribution, we obtain limn→∞ Pξ ( n(ξˆn − ξ) ≤ x) = Φ(x) for all
x ∈ R by Lemma 3.2.1. Now let zγ be such that Φ(−x) = (1 − γ)/2, meaning that we have
zγ = −Φ−1 ((1−γ)/2). As (1−γ)/2 < 1/2, zγ > 0. Also, Φ(zγ ) = 1−Φ(−zγ ) = 1−(1−γ)/2,
and so we obtain
√
lim Pξ (−zγ ≤ n(ξˆn − ξ) ≤ zγ ) = Φ(zγ ) − Φ(−zγ ) = γ.
n→∞
√ √ √
Pξ (−zγ ≤ n(ξˆn − ξ) ≤ zγ ) = Pξ (−zγ / n ≤ ξˆn − ξ ≤ zγ / n)
√ √
= Pξ (ξˆn − zγ / n ≤ ξ ≤ ξˆn + zγ / n),
√ √
so if we define Iγ = (ξˆn − zγ / n, ξˆn + zγ / n), we have limn→∞ Pξ (ξ ∈ Iγ ) = γ for all
ξ ∈ R. This means that asymptotically speaking, there is probability γ that Iγ contains ξ.
√ √
In particular, as Φ(−1.96) ≈ 2.5%, we find that (ξˆn − 1.96/ n, ξˆn + 1.96/ n) is a confidence
interval which a confidence level approaching a number close to 95% as n tends to infinity. ◦
3.7 Higher dimensions 95
Throughout this chapter, we have worked with weak convergence of random variables with
values in R, as well as probability measures on (R, B). Among our most important results
are the results that weak convergence is equivalent to convergence of characteristic functions,
the interplay between convergence in distribution and convergence in probability, the central
limit theorems and our results on asymptotic normality. The theory of weak convergence
and all of its major results can be extended to the more general context of random variables
with values in Rd and probability measures on (Rd , Bd ) for d ≥ 1, and to a large degree, it
is these multidimensional results which are most useful in practice. In this section, we state
the main results from the multidimensional theory of weak convergence without proof.
Definition 3.7.1. Let (µn ) be a sequence of probability measures on (Rd , Bd ), and let µ be
wk
another probability measure. We say that µn converges weakly to µ and write µn −→ µ if it
holds for all f ∈ Cb (Rd ) that limn→∞ f dµn = f dµ.
R R
As in the univariate case, the limit measure is determined uniquely. Also, we say that a
sequence of random variables (Xn ) with values in Rd converges in distribution to a random
variable X with values in Rd or a probability measure µ on (Rd , Bd ) if the distributions
converge weakly. The following analogue of Lemma 3.1.8 then holds.
Lemma 3.7.2. Let (µn ) be a sequence of probability measures on (Rd , Bd ), and let µ be
another probability measure. Let h : Rd → Rp be some continuous mapping. If it holds that
wk wk
µn −→ µ, then it also holds that h(µn ) −→ h(µ).
Theorem 3.7.3 (Cramér-Wold’s device). Let (Xn ) be a sequence of random variables with
D
values in Rd , and let X be some other such variable. Then Xn −→ X if and only if it holds
D
for all θ ∈ Rd that θt Xn −→ θt X.
Letting (Xn )n≥1 be a sequence of random variables with values in Rd and letting X be
some other such variable, we may define a multidimensional analogue of convergence in
P
probability by saying that Xn converges in probability to X and writing Xn −→ X when
96 Weak convergence
We may also define characteristic functions in the multidimensional setting. Let µ be a prob-
ability measure on (Rd , Bd ). We define the characteristic function for µ to be the mapping
t
ϕ : Rd → C defined by ϕ(θ) = eiθ x dµ(x). As in the one-dimensional case, the characteris-
R
tic function determines the probability measure uniquely, and weak convergence is equivalent
to pointwise convergence of probability measures.
where N (0, Σ) denotes the normal distribution with mean zero and variance matrix Σ.
Note that Theorem 3.7.6 reduces to Theorem 3.6.3 for d = p = 1, and in the one-dimensional
case, the products in the expression for the asymptotic variance commute, leading to a simpler
expression in the one-dimensional case than in the multidimensional case.
3.7 Higher dimensions 97
To show the strengh of the multidimensional theory, we give the following example, extending
Example 3.6.4.
Example 3.7.7. As in Example 3.6.4, consider a measurable space (Ω, F) endowed with a
sequence of random variables (Xn ). Let Θ = R × (0, ∞). Assume for each θ = (ξ, σ 2 ) that
we are given a probability measure Pθ such that for the probability space (Ω, F, Pθ ), (Xn )
consists of independent and identically distributed variables with fourth moment, and with
mean ξ and variance σ 2 . As in Example 1.5.4, we may then define estimators of the mean
and variance based on n samples by putting
n n n
!2
1X 1X 2 1X
ξˆn = Xk and σ̂n2 = Xk − Xk .
n n n
k=1 k=1 k=1
Now note that the variables (Xn , Xn2 ) also are independent and identically distributed, and
with ρ denoting Cov(Xn , Xn2 ) and η 2 denoting V Xn2 , we have
! ! ! !
Xn ξ Xn σ2 ρ
E = and V = . (3.25)
Xn2 σ2 + ξ2 Xn2 ρ η2
Let µ and Σ denote the mean and variance, respectively, in (3.25). By X n and Xn2 , we
Pn Pn
denote n1 k=1 Xk and n1 k=1 Xk2 , respectively. Using Theorem 3.7.5, we then obtain that
(X n , Xn2 ) is asymptotically normal with parameters (µ, n1 Σ).
We will use this multidimensional relationship to find the asymptotic distributions of ξˆn and
σ̂n2 , and we will do so by applying Theorem 3.7.6. To this end, we first consider the mapping
f : R2 → R given by f (x, y) = x. Note that we have Df (x, y) = (1 0). As ξˆn = f (X n , Xn2 ),
Theorem 3.7.6 yields that ξˆn is asymptotically normal with mean f (µ) = ξ and variance
! !
1 1 σ2 ρ 1 1
t
Df (µ)ΣDf (µ) = 1 0 2
= σ2 ,
n n ρ η 0 n
in accordance with what we would have obtained by direct application of Theorem 3.5.3.
Next, we consider the variance estimator. Define g : R2 → R by putting g(x, y) = y − x2 .
We then have Dg(x, y) = (−2x 1). As σ̂n2 = g(X n , Xn2 ), Theorem 3.7.6 shows that σ̂n2 is
asymptotically normal with mean g(µ) = σ 2 and variance
! !
1 1 σ2 ρ −2ξ 1
t
Dg(µ)ΣDg(µ) = −2ξ 1 2
= (4ξ 2 σ 2 − 4ξρ + η 2 ).
n n ρ η 1 n
Thus, applying Theorem 3.7.5 and Theorem 3.7.6, we have proven that both ξˆn and σ̂n2 are
asymptotically normal, and we have identified the asymptotic parameters.
98 Weak convergence
Next, consider some 0 < γ < 1. We will show how to construct a confidence interval for ξ
which has a confidence level approximating γ as n tends to infinity. Note that this was already
accomplished in Example 3.6.4 in the case where the variance was known and equal to one.
In this case, we have no such assumptions. Now, we already know that ξˆn is asymptotically
√ D
normal with parameters (ξ, n1 σ 2 ), meaning that n(ξˆn − ξ) −→ N (0, σ 2 ). Next, note that as
P
σ̂n2 ias asymptotically normal with mean σ 2 , Lemma 3.6.2 shows that σ̂n2 −→ σ 2 . Therefore,
√ D
using Theorem 3.3.3, we find that n(ξˆ − ξ)/ σ̂n2 −→ N (0, 1). We may now proceed as
p
in Example 3.6.4 and note that with Φ denoting the cumulative distribution function for
√
the standard normal distribution, limn→∞ P ( n(ξˆn − ξ)/ σ̂n2 ≤ x) = Φ(x) for all x ∈ R by
p
Lemma 3.2.1. Putting zγ = −Φ−1 ((1−γ)/2), we then obtain zγ > 0 and Φ(zγ )−Φ(−zγ ) = γ,
and if we define Iγ = (ξˆn − zγ σ̂n2 /n, ξˆn + zγ σ̂n2 /n), we then obtain
p p
√ √
lim Pθ (ξ ∈ Iγ ) = lim Pθ (ξˆn − σ̂n2 zγ / n ≤ ξ ≤ ξˆn + σ̂n2 zγ / n)
p p
n→∞ n→∞
√ √
= lim Pθ (− σ̂n2 zγ / n ≤ ξˆn − ξ ≤ σ̂n2 zγ / n)
p p
n→∞
√
= lim Pθ (−zγ ≤ n(ξˆn − ξ)/ σ̂n2 ≤ zγ )
p
n→∞
= Φ(zγ ) − Φ(−zγ ) = γ,
3.8 Exercises
Exercise 3.1. Let (θn ) be a sequence of positive numbers. Let µn denote the uniform
distribution on [0, θn ]. Show that µn converges weakly if and only if θn is convergent. In the
affirmative case, identify the limiting distribution. ◦
Exercise 3.2. Let (µn ) be a sequence of probability measures concentrated on N0 , and let
wk
µ be another such probability measure. Show that µn −→ µ if and only if it holds that
limn→∞ µn ({k}) = µ({k}) for all k ≥ 0. ◦
Exercise 3.3. Let µn denote the Student’s t-distribution with n degrees of freedom, that
Γ(n+ 21 ) x2 −(n+ 12 )
is, the distribution with density fn given by fn (x) = √2nπΓ(n) (1 + 2n ) . Show that µn
converges weakly to the standard normal distribution. ◦
Exercise 3.4. Let (pn ) be a sequence in (0, 1), and let µn be the binomial distribution with
success probability pn and length n. Assume that limn→∞ npn = λ for some λ ≥ 0. Show
3.8 Exercises 99
that if λ > 0, then µn converges weakly to the Poisson distribution with parameter λ. Show
that if λ = 0, then µn converges weakly to the Dirac measure at zero. ◦
Exercise 3.5. Let Xn be a random variable which is Beta distributed with shape parameters
√
(n, n). Define Yn = 8n(Xn − 12 ). Show that Yn has density with respect to the Lebesgue
measure. Show that the densities converge pointwise to the density of the standard normal
distribution. Argue that Yn converges in distribution to the standard normal distribution. ◦
Exercise 3.6. Let µ be a probability measure on (R, B) with cumulative distribution function
F . Let q : (0, 1) → R be a quantile function for µ, meaning that for all 0 < p < 1, it holds
that F (q(p)−) ≤ p ≤ F (q(p)). Let µn be the probability measure on (R, B) given by putting
Pn
µn (B) = n1 k=1 1(q(k/(n+1))∈B) for B ∈ B. Show that µn converges weakly to µ. ◦
Exercise 3.7. Let (ξn ) and (σn ) be sequences in R, where σn > 0. Let µn denote the normal
distribution with mean ξn and variance σn2 . Show that µn converges weakly if and only if ξn
and σn both converge. In the affirmative case, identify the limiting distribution. ◦
Exercise 3.8. Let (µn ) be a sequence of probability measures on (R, B) such that µn has
cumulative distribution function Fn . Let µ be some other probability measure with cumu-
lative distribution function F . Assume that F is continuous and assume that µn converges
weakly to µ. Let (xn ) be a sequence of real numbers converging to some point x. Show that
limn→∞ Fn (xn ) = F (x). ◦
Exercise 3.9. Let µn be the measure on (R, B) concentrated on {k/n | k ≥ 1} such that
µn ({k/n}) = n1 (1 − n1 )k−1 for each k ∈ N. Show that µn is a probability measure and that
µn converges weakly to the standard exponential distribution. ◦
Exercise 3.10. Calculate the characteristic function of the binomial distribution with suc-
cess parameter p and length n. ◦
Exercise 3.11. Calculate an explicit expression for the characteristic function of the Poisson
distribution with parameter λ. ◦
Exercise 3.12. Consider a probability space endowed with two independent variables X
and Y with distributions µ and ν, respectively, where µ has characteristic function ϕ and ν
has characteristic function φ. Show that the variable XY has characteristic function ψ given
R
by ψ(θ) = ϕ(θy) dν(y). ◦
100 Weak convergence
Exercise 3.13. Consider a probability space endowed with four independent variables X,
Y , Z and W , all standard normally distributed. Calculate the characteristic function of
XY − ZW and argue that XY − ZW follows a Laplace distribution. ◦
Exercise 3.14. Assume (Xn ) is a sequence of independent random variables. Assume that
Pn
there exists β > 0 such that |Xn | ≤ β for all n ≥ 1. Define Sn = k=1 Xk . Prove that if
P∞ √
it holds that n=1 V Xn is infinite, then (Sn − ESn )/ V Sn converges in distribution to the
standard normal distribution. ◦
Exercise 3.15. Let (Xn ) be a sequence of independent random variables. Let ε > 0. Show
Pn
that if k=1 Xk converges almost surely as n tends to infinity, then the following three series
P∞ P∞ P∞
are convergent: n=1 P (|Xn | > ε), n=1 EXn 1(|Xn |≤ε) and n=1 V Xn 1(|Xn |≤ε) . ◦
Exercise 3.16. Consider a measurable space (Ω, F) endowed with a sequence (Xn ) of
random variables as well as a family of probability measures (Pλ )λ>0 such that under Pλ ,
(Xn ) consists of independent and identically distributed variables such that Xn follows a
Pn
Poisson distribution with mean λ for some λ > 0. Let X n = n1 k=1 Xk . Find a mapping
f : (0, ∞) → (0, ∞) such that for each λ > 0, it holds that under Pλ , f (X n ) is asymptotically
normal with mean f (λ) and variance n1 . ◦
Exercise 3.17. Let (Xn ) be a sequence of independent random variables such that Xn has
Pn P
mean ξ and unit variance. Put Sn = k=1 Xk . Let α > 0. Show that (Sn − nξ)/nα −→ 0 if
and only if α > 1/2. ◦
Exercise 3.18. Let θ > 0 and let (Xn ) be a sequence of independent and identically
distributed random variables such that Xn follows a normal distribution with mean θ and
variance θ. The maximum likelihood estimator for estimation of θ based on n samples is
Pn
θ̂n = − 21 + ( 14 + n1 k=1 Xk2 )1/2 . Show that θ̂n is asymptotically normal with mean θ and
3
+2θ 2
variance n1 4θ4θ2 +2θ+1 . ◦
Exercise 3.19. Let µ > 0 and let (Xn ) be a sequence of independent and identically
distributed random variables such that Xn follows an exponential distribution with mean
Pn −1
1/µ. Let X n = n1 k=1 Xk . Show that X n and X n are asymptotically normal and identify
Pn P
the asymptotic parameters. Define Yn = log1 n k=1 Xkk . Show that Yn −→ 1/µ. ◦
Exercise 3.20. Let θ > 0 and let (Xn ) be a sequence of independent and identically
distributed random variables such that Xn follows a uniform distribution on [0, θ]. Define
Pn
X n = n1 k=1 Xk . Show that X n is asymptotically normal with mean µ and variance n1 µ2 .
3.8 Exercises 101
Pn P
Next, put Yn = n42 k=1 kXk . Demonstrate that Yn −→ θ. Use Lyapounov’s central limit
p
theorem to show that (Yn − θ)/( 4θ2 /9n) converges to a standard normal distribution. ◦
Exercise 3.21. Let (Xn , Yn ) be a sequence of independent and identically distributed vari-
ables such that for each n ≥, Xn and Yn are independent, where Xn follows a standard normal
distribution and Yn follows an exponential distribution with mean α for some α > 0. Define
Pn Pn
Sn = n1 k=1 Xk + Yk and Tn = n1 k=1 Xk2 . Show that (Sn , Tn ) is asymptotically normally
√
distributed and identify the asymptotic parameters. Show that Sn / Tn is asymptotically
normally distributed and identify the asymptotic parameters. ◦
Exercise 3.22. Let (Xn ) be a sequence of independent and identically distributed variables
such that Xn follows a normal distribution with mean µ and variance σ 2 for some σ > 0.
Pn Pn
Assume that µ 6= 0. Define X n = n1 k=1 Xk and Sn2 = n1 k=1 (Xk − X n )2 . Show that
Sn /X n is asymptotically normally distributed and identify the asymptotic parameters. ◦
102 Weak convergence
Chapter 4
In this chapter, we will consider two important but also very distinct topics: Decompositions
of signed measures and conditional expectations. The topics are only related by virtue of the
fact that we will use results from the first section to prove the existence of the conditional
expectations to be defined in the following section. In the first section, the framework is a
measurable space that will be equipped with a so–called signed measure. In the rest of the
chapter, the setting will a probability space endowed with a random variable.
In the rest of the section, we let (Ω, F) be a measurable space. Recall that µ : F → [0, ∞] is
a measure on (Ω, F) if µ(∅) = 0 and for all disjoint sequences F1 , F2 , . . . it holds that
∞
[ X∞
µ Fn = µ(Fn ) .
n=1 n=1
If µ(Ω) < ∞ we say that µ is a finite measure. However, in the context of this section,
we shall most often use the name bounded, positive measure. A natural generalisation of
a bounded, positive measure is to allow negative values. Hence we consider the following
definition:
Note that condition (2) is similar to the σ–additivity condition for positive measures. Con-
dition (1) ensures that ν is bounded.
A bounded, signed measure has further properties that resemble properties of positive mea-
sures:
Theorem 4.1.2. Assume that ν is a bounded, signed measure on (Ω, F). Then
(1) ν(∅) = 0 .
ν(Fn ) → ν(F )
Proof. To prove (1) let F1 = F2 = · · · = ∅ in the σ–additivity condition. Then we can utilize
S∞
the simple fact ∅ = n=1 ∅ such that
∞
[ ∞
X
ν(∅) = ν ∅ = ν(∅)
n=1 n=1
4.1 Decomposition of signed measures 105
Considering the second result, let FN +1 = FN +2 = · · · = ∅ and apply the σ–additivity again
such that
N
[ N
[ ∞
[ XN ∞
X N
X
ν Fn = ν Fn ∪ ∅ = ν(Fn ) + 0= ν(Fn ) .
n=1 n=1 n=N +1 n=1 n=N +1 n=1
so
N
[ N ∞
X X
ν(FN ) = ν Gn = ν(Gn ) → ν(Gn ) = ν(F ) as N → ∞.
n=1 n=1 n=1
From the definition of a bounded, signed measure and Theorem 4.1.2 we almost immediately
see that bounded, signed measures with non–negative values are in fact bounded, positive
measures.
Theorem 4.1.3. Assume that ν is a bounded, signed measure on (Ω, F). If ν only has values
in [0, ∞), then ν is a bounded, positive measure.
Proof. That ν is a measure in the classical sense follows since it satisfies the σ–additivity
condition, and we furthermore have ν(∅) = 0 according to (1) in Theorem 4.1.2. That
ν(Ω) < ∞ is obviously a consequence of (1) in Definition 4.1.1.
Example 4.1.4. Let Ω = {1, 2, 3, 4} and assume that ν is a bounded, signed measure on Ω
given by
ν({1}) = 2 ν({2}) = −1 ν({3}) = 4 ν({4}) = −2 .
Then e.g.
ν({1, 2}) = 1 ν({3, 4}) = 2
and
ν(Ω) = 3 .
106 Signed measures and conditioning
so we see that although {3} ( Ω, it is possible that ν({3}) > ν(Ω). Hence, the condition 1 in
the definition is indeed meaningful: Only demanding that ν(Ω) < ∞ as for positive measures
would not ensure that ν is bounded on all sets, and in particular ν(Ω) is not necessarily an
upper bound for ν. ◦
If ν is a bounded, signed measure and F1 , F2 ∈ F with F1 ⊆ F2 then it need not hold that
ν(F1 ) ≤ ν(F2 ):
but ν(F2 \F1 ) ≥ 0 and ν(F2 \F1 ) < 0 are both possible.
Recall from classical measure theory that new positive measures can be constructed by in-
tegrating non–negative functions with respect to other positive measures. Similarly, we can
construct a bounded, signed measure by integrating an integrable function with respect to a
bounded, positive measure.
Theorem 4.1.5. Let µ be a bounded, positive measure on (Ω, F) and let f : (Ω, F) → (R, B)
be a µ-integrable function, i.e. Z
|f | dµ < ∞.
Then Z
ν(F ) = f dµ (F ∈ F) (4.1)
F
defines a bounded, signed measure on (Ω, F). Furthermore it holds that ν is a bounded,
positive measure if and only if f ≥ 0 µ-a.e. (almost everywhere).
which gives (1). To obtain that (2) is satisfied, let F1 , F2 , . . . ∈ F be disjoint and define
F = ∪Fn . Observe that
|1∪Nn=1 Fn
f | ≤ |f |
4.1 Decomposition of signed measures 107
The last statement follows from Theorem 4.1.3, since f ≥ 0 µ–a.e. implies that ν(F ) ≥ 0 for
all F ∈ F.
In the following definition we introduce two possible relations between a signed measure
and a positive measure. A main result in this chapter will be that the two definitions are
equivalent.
Definition 4.1.6. Assume that µ is a bounded, positive measure, and that ν is a bounded,
signed measure on (Ω, F).
(2) ν has density with respect to µ if there exists a µ-integrable function f (the density),
dν
such that (4.1) holds. If ν has density with respect to µ we write ν = f · µ and f = dµ . f is
called the Radon-Nikodym derivative of ν with respect to µ.
Lemma 4.1.7. Assume that µ is a bounded, positive measure on (Ω, F) and that ν is a
bounded signed measure on (Ω, F). If ν = f · µ, then ν µ.
Definition 4.1.8. Assume that ν, ν1 , and ν2 are bounded, signed measures on (Ω, F).
(2) ν1 and ν2 are singular (we write ν1 ⊥ ν2 ), if there exist disjoint sets F1 , F2 ∈ F such
that ν1 is concentrated on F1 and ν2 is concentrated on F2 .
Example 4.1.9. Let µ be a bounded, positive measure on (Ω, F), and assume that f is
µ-integrable. Define the bounded, signed measure ν by ν = f · µ. Then ν is concentrated
on (f 6= 0): Take G ⊆ (f 6= 0)c , or equivalently G ⊆ (f = 0). Then 1G f = 0, so ν(G) =
R R
1G f dµ = 0 dµ = 0 and we have the result.
Now assume that both f1 and f2 are µ-integrable and define ν1 = f1 · µ and ν2 = f2 · µ. Then
it holds that ν1 ⊥ ν2 if (f1 6= 0) ∩ (f2 6= 0) = ∅.
In fact, the result would even be true if we only have µ((f1 6= 0) ∩ (f2 6= 0)) = 0 (why?). ◦
Proof. To show the first statement, assume that ν is concentrated on F ∈ F, and let G ∈ F
satisfy G ⊇ F . For any set G0 ⊆ Gc we have that G0 ⊆ F c , so by the definition we have
ν(G0 ) = 0.
For the second result, we only need to show that if µ(F c ) = 0, then µ is concentrated on F .
So assume that G ⊆ F c . Then, since µ is assumed to be a positive measure, we have
0 ≤ µ(G) ≤ µ(F c ) = 0
The following theorem is a deep result from classical measure theory, stating that any
bounded, signed measure can be constructed as the difference between two bounded, positive
measures.
Proof. The existence: Define λ = inf{ν(F ) : F ∈ F}. Then −∞ < λ ≤ 0 and for all n ∈ N
there exists Fn ∈ F with
1
ν(Fn ) ≤ λ + n .
2
We first show that with G = (Fn evt.) = ∪∞ ∞
n=1 ∩k=n Fk it holds that
ν(G) = λ . (4.2)
Note that
∞
\ ∞ \
[ ∞
Fk ↑ Fk = G as n → ∞
k=n n=1 k=n
∞
Similarly we have ∩N
k=n Fk ↓ ∩k=n Fk (as N → ∞) so
N
\
ν(G) = lim lim ν Fk .
n→∞ N →∞
k=n
So we have that ν(G) = λ, if we can show (4.3). This is shown by induction for all N ≥ n.
If N = n the result is trivial from the choice of Fn :
N N
\ 1 X 1
ν Fk = ν(Fn ) ≤ λ + n = λ + .
2 2k
k=n k=n
In the inequality we have used (4.3) for N − 1, the definition of FN , and that ν(F ) ≥ λ for
all F ∈ F.
Obviously e.g.
sup{|ν − (F )| : F ∈ F} ≤ sup{|ν(F )| : F ∈ F} < ∞
and
∞
[ [∞ ∞
[
ν− Fn = −ν ( Fn ) ∩ G = −ν (Fn ∩ G)
n=1 n=1 n=1
∞
X ∞
X
=− ν(Fn ∩ G) = ν − (Fn )
n=1 n=1
for F1 , F2 , . . . ∈ F disjoint sets, so ν + and ν − are bounded, signed measures. It is easily seen
(since G and Gc are disjoint) that ν = ν + −ν − . We furthermore have that ν − is concentrated
on G, since for F ⊆ Gc
ν − (F ) = −ν(F ∩ G) = −ν(∅) = 0 .
Similarly ν + is concentrated on Gc , so we must have ν − ⊥ ν + .
The existence part of the proof can now be completed by showing that ν + ≥ 0, ν − ≥ 0. For
F ∈ F we have F ∩ G = G \ (F c ∩ G) so
ν − (F ) = −ν(F ∩ G)
= −(ν(G) − ν(F c ∩ G))
= −λ + ν(F c ∩ G) ≥ 0
and
for all F ∈ F.
ν + = f + · µ, ν − = f − · µ,
where f + = f ∨ 0 and f − = −(f ∧ 0) denote the positive and the negative part of f ,
respectively. The argument is by inspection: It is clear that ν = ν + − ν − , ν + ≥ 0, ν − ≥ 0
and moreover ν + ⊥ ν − since ν + is concentrated on (f ≥ 0) and ν − is concentrated on
(f < 0). ◦
ν = f · µ + νs .
∼ ∼
The decomposition is unique in the sense that if ν = f · µ + ν s is another decomposition, then
∼ ∼
f = f µ − a.e., ν s = νs .
Proof. We begin with the uniqueness part of the theorem: Assume that
∼ ∼
ν = f · µ + νs = f · µ + ν s ,
∼ ∼
where µ ⊥ νs and µ ⊥ ν s . Choose F0 , F 0 ∈ F such that νs is concentrated on F0 , µ is
∼ ∼ ∼
concentrated on F0c , ν s is concentrated on F 0 and µ is concentrated on F c0 . Define G0 =
∼
F0 ∪ F 0 . According to Lemma 4.1.10 we only need to show that µ(G0 ) = 0 in order to
conclude that µ is concentrated on Gc0 , since µ ≥ 0. This is true since
ν̃s (F ) − νs (F ) = ν̃s (F ∩ G0 ) − νs (F ∩ G0 )
= ν(F ∩ G0 ) − (f˜µ)(F ∩ G0 ) − ν(F ∩ G0 ) − (f µ)(F ∩ G0 )
= ν(F ∩ G0 ) − 0 − ν(F ∩ G0 ) − 0 = 0 ,
To prove existence, it suffices to consider the case ν ≥ 0. For a general ν we can find the
Jordan–Hahn decomposition, ν = ν + − ν − , and then apply the Lebesgue decomposition to
ν + and ν − separately:
ν + = f µ + νs and ν − = gµ + κs
where there exist F0 and F̃0 such that νs is concentrated on F0 , µ is concentrated on F0c , κs
is concentrated on F̃0 , and µ is concentrated on F̃0c . Defining G0 = F0 ∪ F̃0 we can obtain,
similarly to the argument above, that
So assume that ν ≥ 0. Let L(µ)+ denote the set of non–negative, µ–integrable functions and
define n Z o
H = g ∈ L(µ)+ ν(F ) ≥ g dµ for all F ∈ F
F
Recall that ν ≥ 0 such that e.g. 0 ∈ H. Define furthermore
nZ o
α = sup g dµ g ∈ H .
R
Since Ω
g dµ ≤ ν(Ω) for all g ∈ H, we must have
0 ≤ α ≤ ν(Ω) < ∞ .
4.1 Decomposition of signed measures 113
R
We will show that there exists f ∈ H with f dµ = α.
νs = ν − f · µ
for all F ∈ F.
What remains in the proof is showing that νs ⊥ µ. For all n ∈ N define the bounded, signed
measure (see e.g. Exercise 4.1) λn by
1
λ n = νs − µ
n
−
Let λn = λ+
n − λn be the Jordan–Hahn decomposition of λn . Then we can find Fn ∈ F such
that
λ−
n is concentrated on Fn and λ+ c
n is concentrated on Fn
114 Signed measures and conditioning
so f˜n ∈ H. Hence
Z Z
1 1
α≥ f˜n dµ = f dµ + µ(Fnc ) = α + µ(Fnc ) .
n n
This implies that µ(Fnc ) = 0 leading to
∞
[
µ Fnc = 0 .
n=1
1
0 ≤ νs (F0 ) ≤ νs (Fn ) = µ(Fn ) + λn (Fn )
n
1 1
= µ(Fn ) − λ−n (Fn ) ≤ µ(Ω)
n n
which for n → ∞ implies that νs (F0 ) = 0. Hence (since νs ≥ 0) νs is concentrated on F0c .
Proof. That f is uniquely determined follows from the uniqueness in the Lebesgue decom-
position. Also ν ≥ 0 implies f ≥ 0 µ–a.e.
4.2 Conditional Expectations given a σ-algebra 115
In the ”if and only if” part it only remains to show, that ν µ implies the existence of a
F–measurable and µ–integrable function f with ν = f µ. So assume that ν µ and consider
the Lebesgue decomposition of ν
ν = f · µ + νs .
Choose F0 such that νs is concentrated on F0 and µ is concentrated on F0c . For F ∈ F we
then obtain that
νs (F ) = νs (F ∩ F0 ) = ν(F ∩ F0 ) − (f µ)(F ∩ F0 ) = 0
since µ(F ∩ F0 ) = 0 and since ν µ implies that ν(F ∩ F0 ) = 0. Hence νs = 0 and the claim
ν = f · µ follows.
In this section we will return to considering a probability space (Ω, F, P ) and real random
variables defined on this space. We shall see how the existence of conditional expectations
can be shown using a Radon-Nikodym derivative. In the course MI the existence is shown
from L2 -theory using projetions on the subspace L2 (Ω, D, P ) of L2 (Ω, F, P ), when D ⊆ F is
a sub σ-algebra.
Let X be a real random variable defined on (Ω, F, P ) with E|X| < ∞. A conditional
expectation of X (given something) can be interpreted as a guess on the value of X(ω) based
on varying amounts of information about which ω ∈ Ω has been drawn. If we know nothing
about ω, then it is not possible to say very much about the value of X(ω). Perhaps the best
R
guess we can come up with is suggesting the value E(X) = X dP !
Now let D1 , . . . , Dn be a system of disjoint sets in F with ∪ni=1 Di = Ω, and assume that for
a given ω ∈ Ω, we know whether ω ∈ Di for each i = 1, . . . , n. Then we actually have some
information about the ω that has been drawn, and an educated guess on the value of X(ω)
may not be as simple as E(X) any more. Instead our guessing strategy will be
Z
1
guess on X(ω) = X dP if ω ∈ Di . (4.4)
P (Di ) Di
We are still using an integral of X, but we only integrate over the set Di , where we know that
ω is an element. It may not be entirely clear, why this is a good strategy for our guess (that
will probably depend on the definition of a good guess), but at least it seems reasonable that
we give the same guess on X(ω) for all ω ∈ Di .
116 Signed measures and conditioning
Example 4.2.1. Suppose Ω = {a, b, c, d} and that the probability measure P is given by
1
P ({a}) = P ({b}) = P ({c}) = P ({d}) =
4
and furthermore that X : Ω → R is defined by
Similarly, if we know ω ∈ Dc = {c, d}, then the best guess would be 2.5. Given the knowledge
of whether ω ∈ D or ω ∈ Dc we can write the guess as a function of ω, namely
Note that the collection {D1 , . . . , Dn } is a sub σ–algebra of F. The stability requirements for
σ–algebras fit very well into the knowledge of whether ω ∈ Di for all i: If we know whether
ω ∈ Di , then we also know whether ω ∈ Dic . And we know whether ω ∈ ∪Ai , if we know
whether ω ∈ Ai for all i.
The concept of conditional expectations takes the guessing strategy to a general level, where
the conditioning σ-algebra D is general sub σ–algebra of F. The result will be a random
variable (as in Example 4.2.1) which we will denote E(X|D) and call the conditional expecta-
tion of X given D. We will show in Example 4.2.5 that when D has the form {D1 , . . . , Dn }
as above, then E(X|D) is given by (4.4).
Definition 4.2.2. let X be a real random variable defined on (Ω, F, P ) with E|X| < ∞. A
conditional expectation X given D is a D-measurable real random variable, denoted E(X|D)
which satisfies
E|E(X|D)| < ∞, (1)
Z Z
E(X|D) dP = X dP for all D ∈ D . (2)
D D
4.2 Conditional Expectations given a σ-algebra 117
Note that one cannot in general use E(X|D) = X (even though it satisfies (1) and (2)): X
is assumed to be F-measurable but need not be D-measurable.
∼
Proof. For the first result consider, e.g., D = (U > U ). Then
Z ∼ Z ∼ Z Z Z
(U − U ) dP = U dP − U dP = X dP − X dP = 0
D D D D D
∼ ∼
according to (2) in Definition 4.2.2 . But U > U on D, so therefore P (D) = P (U > U ) = 0.
∼
Similarly, P (U < U ) = 0.
Theorem 4.2.4. If X is a real random variable with E|X| < ∞, then there exists a condi-
tional expectation of X given D.
Then ν is a bounded, signed measure on (Ω, D). Let P 0 denote the restriction of P to D:
P 0 is the probability measure on (Ω, D) given by
P 0 (D) = P (D)
That the last equation in (4.5) is true is just basic measure theory: The integral of a function
with respect to some measure does not change if the measure is extended to a larger σ–
algebra; the function is also a measurable function on the larger measurable space.
A direct argument could be first looking at indicator functions. Let D ∈ D and note that 1D
is D–measurable. Then
Z Z
1D dP 0 = P 0 (D) = P (D) = 1D dP .
if Y is a linear combination of indicator functions, and finally the result is shown to be true
for general D–measurable functions Y by a standard approximation argument.
Example 4.2.5. consider a probability space (Ω, F, P ) and a real random variable X defined
on Ω with E|X| < ∞. Assume that D1 , . . . , Dn ∈ F form a partition of Ω: Di ∩ Dj = ∅ for
i 6= j and ∪ni=1 Di = Ω. Also assume (for convenience) that P (Di ) > 0 for all i = 1, . . . , n.
Let D be the σ–algebra generated by the Dj –sets. Then D ∈ D if and only if D is a union
of some Dj ’s.
Z m Z
X
X dP = X dP
D k=1 Dik
so
Z m Z n Z
X 1
X
U dP = X dP 1Di dP
D P (Di ) Di
k=1 Dik i=1
m Z Z
X 1
= X dP 1Dik dP
P (Dik ) Di
k=1 Dik k
m Z Z
X 1
= X dP 1Dik dP
P (Dik ) Di Dik
k=1 k
Xm Z Z
= X dP = X dP .
k=1 Dik D
Hence U satisfies the conditions in Definition 4.2.2 and is therefore a conditional expectation
of X given D. ◦
We shall now show a series of results concerning conditional expectations. The results and the
proofs are well-known from the course MI. In Theorem 4.2.6, X, Y and Xn are real random
variables, all of which are integrable.
(3) If X ≥ 0 a.s. then E(X|D) ≥ 0 a.s. If Y ≥ X a.s. then E(Y |D) ≥ E(X|D) a.s.
120 Signed measures and conditioning
E(X|D) = EX a.s.
E(X|D) = X a.s.
(7) If it holds for all n ∈ N that Xn ≥ 0 a.s. and Xn+1 ≥ Xn a.s. with lim Xn = X a.s.,
then
lim E(Xn |D) = E(X|D) a.s.
n→∞
Proof. (1) We show that the constant variable U given by U (ω) = c meets the conditions
from Definition 4.2.2. Firstly, it is D–measurable, since for B ∈ B we have
(
−1 Ω c∈B
U (B) =
∅ c∈ /B
which is D measurable in either case. Furthermore E|U | = |c| < ∞ and obviously
Z Z Z
U dP = c dP = X dP .
D D D
4.2 Conditional Expectations given a σ-algebra 121
(2) αE(X|D) + βE(Y |D) is D-measurable and integrable, so all we need to show is (see
part (2) of Definition 4.2.2), that
Z Z
(αE(X|D) + βE(Y |D)) dP = (αX + βY ) dP
D D
for all D ∈ D. But here, the left hand side is
Z Z
α E(X|D) dP + β E(Y |D) dP,
D D
which is seen to equal the right hand side when we use Definition 4.2.2 on both terms.
(3) For the first claim define D = (E(X|D) < 0) = E(X|D)−1 ((−∞, 0)) and note that
D ∈ D since E(X|D) is D–measurable. Then
Z Z
E(X|D) dP = X dP ≥ 0 ,
D D
R
since X ≥ 0 a.s. But the fact that E(X|D) < 0 on D, makes D
E(X|D) dP < 0, if P (D) > 0.
Hence P (D) = 0 and E(X|D) ≥ 0 a.s.
For the second claim, just use the first result on Y − X and apply (2).
(4) Firstly, we show that E(X|D) = E[E(X|E)|D] a.s. By definition we have that
E[E(X|E)|D] is D–measurable with finite expectation, and for D ∈ D
Z Z Z
E[E(X|E)|D] dP = E(X|E) dP = X dP .
D D D
Hence we have the result from Definition 4.2.2. In the first equality, it is used that E[E(X|E)|D]
is a conditional expectation of E(X|E) given D. In the second equality Definition 4.2.2 is
applied to E(X|E), using that D ∈ D ⊆ E.
Secondly, we prove that E(X|D) = E[E(X|D)|E] a.s. by showing that E(X|D) is a condi-
tional expectation of E(X|D) given E. But that follows directly from 6, since E(X|D) is
E–measurable.
(5) As in (1), the constant map ω 7→ EX is D–measurable and has finite expectation, so
it remains to show that for D ∈ D
Z Z
EX dP = X dP .
D D
The left hand side is EX · P (X). For the right hand side we obtain the following, using that
1D and X are independent
Z Z Z Z
X dP = 1D · X dP = 1D dP · X dP = P (D) · EX ,
D
122 Signed measures and conditioning
(6) Trivial.
(7) According to (3) we have for all n ∈ N that E(Xn+1 |D) ≥ E(Xn |D) a.s., so with
we have P (Fn ) = 1. Let F0 = (E(X1 |D) ≥ 0) such that P (F0 ) = 1. With the definition
F = ∩∞ n=0 Fn we have F ∈ D and P (F ) = 1. For ω ∈ F it holds that the sequence
E(Xn |D)(ω) n∈N is increasing and E(X1 |D)(ω) ≥ 0. Hence for ω ∈ F the number Y (ω) =
limn→∞ E(Xn |D)(ω)
is well–defined in [0, ∞]. Defining e.g. Y (ω) = 0 for ω ∈ F c makes Y a D–measurable random
variable (since F is D–measurable, and Y is the point–wise limit of 1F E(Xn |D) that are all
R
D–measurable variables) with values in [0, ∞]. Thus the integrals G Y dP of Y makes sense
for all G ∈ F.
In particular we obtain the following for D ∈ D using monotone convergence in the third and
the sixth equality
Z Z Z Z
Y dP = Y dP = lim E(Xn |D) dP = lim E(Xn |D) dP
D D∩F D∩F n→∞ n→∞ D∩F
Z Z Z Z
= lim E(Xn |D) dP = lim Xn dP = lim Xn dP = X dP .
n→∞ D n→∞ D D n→∞ D
(8) Since XE(Y |D) is obviously D–measurable, it only remains to show that E|XE(Y |D)| <
∞ and that Z Z
XE(Y |D) dP = XY dP (4.6)
D D
for all D ∈ D.
We now prove the result for all X ≥ 0 and Y ≥ 0 by showing the equation in the following
steps:
X = lim Xn ,
n→∞
where
n
n2
X k−1
Xn = 1( k−1
n ≤X<
k
) .
2n 2 2n
k=1
for all n ∈ N. Furthermore the construction of (Xn )n∈N makes it non–negative and increasing.
Since we have assumed that Y ≥ 0 we must have XE(Y |D) ≥ 0 a.s. Thereby the inte-
R
gral D XE(Y |D) dP is defined for all D ∈ D (but it may be +∞). Since the sequence
124 Signed measures and conditioning
(Xn E(Y |D))n∈N is almost surely increasing (increasing for all ω with E(Y |D)(ω) ≥ 0) we
obtain
Z Z Z
XE(Y |D) dP = lim Xn E(Y |D) dP = lim Xn E(Y |D) dP
D D n→∞ n→∞ D
Z Z
= lim Xn Y dP = XY dP ,
n→∞ D D
where the second and the fourth equality follow from monotone convergence, and the third
equality is a result of (4.7). From this (since E|XY | < ∞) we in particular see, that
Z Z
E|XE(Y |D)| = E(XE(Y |D)) = XE(Y |D) dP = XY dP = E(XY ) < ∞ .
Ω Ω
Hence (8) is shown in the case, where X ≥ 0 and Y ≥ 0. That (8) holds in general then
easily follows by splitting X and Y up into their positive and negative parts, X = X + − X −
and Y = Y + − Y − , and then applying the version of (8) that deals with positive X and Y
on each of the terms we obtain the desired result by multiplying out the brackets.
(9) The full proof is given in the lecture notes of the course MI. That
In this section we will consider the special case, where the conditioning σ–algebra D is
generated by a random variable Y . So assume that Y : (Ω, F) → (E, E) is a random variable
with values in the space E that is not necessarily R. If D = σ(Y ), i.e. the σ-algebra generated
by Y , we write
E(X|Y )
rather than E(X|D) and the resulting random variable is referred to as the conditional
expectation of X given Y . Recall that D ∈ σ(Y ) is always of the form D = (Y ∈ A) for
some A ∈ E. Then we immediately have the following characterization of E(X|Y ):
4.3 Conditional expectations given a random variable 125
Theorem 4.3.1. Let X be a real random variable with E|X| < ∞, and assume that Y is a
random variable with values in (E, E). Then the conditional expectation E(X|Y ) of X given
Y is characterised by being σ(Y )-measurable and satisfying E|E(X|Y )| < ∞ and
Z Z
E(X|Y )dP = X dP for all A ∈ E.
(Y ∈A) (Y ∈A)
∼ ∼
Note that if σ(Y ) = σ(Y ), then E(X|Y ) = E(X|Y ) a.s. If e.g. Y takes values in the
real numbers and ψ : E → E is a bijective and bimeasurable map (ψ and ψ −1 are both
measurable), then
E(X|Y ) = E(X|ψ(Y )) a.s.
The following lemma will be extremely useful in the comprehension of conditional expecta-
tions given random variables.
Lemma 4.3.2. A real random variable Z is σ(Y )-measurable if and only if there exists a
measurable map φ : (E, E) → (R, B) such that
Z = φ ◦ Y.
Because of the approximation (4.8) the argument will be complete, if we can show that H
has the following properties
(ii) a1 Z1 + · · · + an Zn ∈ H if Z1 , . . . , Zn ∈ H and a1 , . . . , an ∈ R
(i): Assume that D ∈ σ(Y ). Then there exists a set A ∈ E such that D = (Y ∈ A) (simply
from the definition of σ(Y )). But then 1D = 1A ◦ Y , since
1D (ω) = 1 ⇔ ω ∈ D = (Y ∈ A)
⇔ Y (ω) ∈ A
⇔ (1A ◦ Y )(ω) = 1 .
for all ω ∈ Ω. In particular the limit limn→∞ φn (y) exists for all y ∈ Y (Ω) = {Y (ω) : ω ∈ Ω}.
Define φ : E → R by
(
limn→∞ φn (y) if the limit exists
φ(y) = ,
0 otherwise
then φ is E − B–measurable, since F = (limn φn exists) ∈ E and φ = limn (1F φn ) with each
1F φn being E − B–measurable. Furthermore note that Z(ω) = φ(Y (ω)), so Z = φ ◦ Y . Hence
(iii) is shown.
Now we return to the discussion of the σ(Y )-measurable random variable E(X|Y ). By the
lemma, we have that
E(X|Y ) = φ ◦ Y,
4.3 Conditional expectations given a random variable 127
Proof. Firstly, assume that φ defines a conditional expectation of X given Y = y for all y.
Then we have E(X|Y ) = φ ◦ Y so
Z Z Z
|φ(y)| dY (P )(y) = |φ ◦ Y | dP = |E(X|Y )| dP = E|E(X|Y )| < ∞ ,
E Ω Ω
and we have shown that φ is Y (P )–integrable. Above, the first equality is a result of the
Change–of–variable Formula. Similarly we obtain for all B ∈ E
Z Z Z
φ(y) dY (P )(y) = E(X|Y ) dP = X dP .
B (Y ∈B) (Y ∈B)
for all B ∈ E.
Firstly, we note that φ◦Y is σ(Y )–measurable (as a result of the trivial implication in Lemma
4.3.2). Furthermore we have
Z Z
|φ ◦ Y | dP = |φ(y)| dY (P )(y) < ∞
Ω E
using the change–of–variable formula again (simply the argument from above backwards).
Finally for D ∈ σ(Y ) we have B ∈ E with D = (Y ∈ B) so
Z Z Z
φ ◦ Y dP = φ(y) dY (P )(y) = X dP ,
D B D
where we have used the assumption. This shows that φ ◦ Y is a conditional expectation of
X given Y , so φ ◦ Y = E(X|Y ). From that we have by definition, that φ(y) is a conditional
expectation of X given Y = y.
128 Signed measures and conditioning
4.4 Exercises
Exercise 4.1. Assume that ν1 and ν2 are bounded, signed measures. Show that αν1 + βν2
is a bounded, signed measure as well, when α, β ∈ R are real–valued constants, using the
(obvious) definition
(αν1 + βν2 )(A) = αν1 (A) + βν2 (A) .
Note that the definition ν µ also makes sense if µ is a positive measure (not necessarily
bounded).
Exercise 4.2. Let τ be the counting measure on N0 = N ∪ {0} (equipped with the σ–algebra
P(N0 ) that contains all subsets). Let µ be the Poisson distribution with parameter λ:
λn −λ
µ({n}) = e for n ∈ N0 .
n!
Show that µ τ . Does µ have a density f with respect to τ ? In that case find f .
Now let ν be the binomial distribution with parameters (N, p). Decide whether µ ν and/or
ν µ. ◦
Exercise 4.3. Assume that µ is a bounded, positive measure and that ν1 , ν2 µ are
bounded, signed measures. Show
Exercise 4.4. Assume that π, µ are bounded, positive measures and that ν is a bounded,
signed measure, such that ν π µ. Show that
dν dν dπ
= µ–a.e.
dµ dπ dµ
◦
Exercise 4.5. Assume that µ, ν are bounded, positive measures such that ν µ and µ ν.
Show −1
dν dµ
= .
dµ dν
4.4 Exercises 129
Exercise 4.6. Assume that ν is a bounded, signed measure and that µ is a σ–finite measure
R
with ν µ. Show that there exists f ∈ L(µ) such that ν = f · µ (meaning ν(F ) = F f dµ).
◦
In the following exercises we assume that all random variables are defined on a probability
space (Ω, F, P ).
Exercise 4.7. Let X and Y be random variables with E|X| < ∞ and E|Y | < ∞ that are
both measurable with respect to some sub σ–algebra D. Assume furthermore that
Z Z
X dP = Y dP for all D ∈ D .
D D
Exercise 4.8. Assume that X1 and X2 are independent random variables satisfying X1 ∼
exp(β) and X2 ∼ N (0, 1). Define Y = X1 + X2 and the sub σ–algebra D by D = σ(X1 ).
Show that E(Y |D) = X1 a.s. ◦
Exercise 4.9. Assume that X is a real random variable with EX 2 < ∞ and that D is some
sub σ–algebra. Let Y = E(X|D). Show that
X(P ) = Y (P ) ⇔ X = Y a.s.
Exercise 4.10. Let X and Y be random variables with EX 2 < ∞ and EY 2 < ∞. The
conditional variance of X given the sub σ–algebra D is defined by
Show that
◦
130 Signed measures and conditioning
Exercise 4.11. Let X be a real random variable with E|X| < ∞. Let D be a sub σ–algebra.
Show without referring to (9) in Theorem 4.2.6 that
Exercise 4.12. Let X be a real random variable with E|X| < ∞. Let D be a sub σ–algebra.
Show without referring to (9) in Theorem 4.2.6 that
Exercise 4.13. Let (Ω, F, P ) = ((0, 1), B, λ) (where λ is the Lebesgue measure on (0, 1)).
Define the real random variable X by
X(ω) = ω
and
D = {D ⊆ (0, 1) | D or Dc is countable} .
Then D is a sub σ–algebra of B (you can show this if you want...). Find a version of E(X|D).
◦
Exercise 4.14. Let (Ω, F, P ) = ([0, 1], B, λ), where λ is the Lebesgue measure on [0, 1].
Consider the two real valued random variables
X1 (ω) = 1 − ω X2 (ω) = ω 2
Show that for any given real random variable Y it holds that E(Y |X1 ) = E(Y |X2 ).
Show by giving an example that E(Y |X1 = x) and E(Y |X2 = x) may be different on a set
of x’s with positive Lebesgue measure. ◦
Exercise 4.15. Assume that X is a real random variable with E|X| < ∞ and that D is a
sub σ–algebra of F. Assume that Y is a D–measurable real random variable with E|Y | < ∞
that satisfies
E(X) = E(Y )
and Z Z
Y dP = X dP
D D
4.4 Exercises 131
Exercise 4.16. Let X = (X1 , X2 , . . .) be a stochastic process, and assume that Y and Z
are real random variables, such that (Z, Y ) is independent of X. Assume that Y has finite
expectation.
(1) Show that E(X1 |Sn ) = E(X1 |Sn , Sn+1 , Sn+2 , . . .) a.s.
1
(2) Show that n Sn = E(X1 |Sn , Sn+1 , Sn+2 , . . .) a.s.
Exercise 4.18. Assume that (X, Y ) follows the two–dimensional Normal distribution with
mean vector (µ1 , µ2 ) and covariance matrix
" #
σ11 σ12
σ21 σ22
where σ12 = σ21 . Then X ∼ N (µ1 , σ11 ), Y ∼ N (µ2 , σ22 ) and Cov(X, Y ) = σ12 .
Show that
E(X|Y ) = µ1 + β(Y − µ2 ) ,
where β = σ12 /σ22 . ◦
132 Signed measures and conditioning
Chapter 5
Martingales
In this chapter we will present the classical theory of martingales. Martingales are sequences
of real random variables, where the index set N (or [0, ∞)) is regarded as a time line, and
where – conditionally on the present level – the level of the sequence at a future time point
is expected to be as the current level. So it is sequences that evolves over time without a
drift in any direction. Similarly, submartingales are expected to have the same or a higher
level at future time points, conditioned on the present level.
In Section 5.1 we will give an introduction to the theory based on a motivating example from
gambling theory. The basic definitions will be presented in Section 5.2 together with results
on the behaviour of martingales observed at random time points. The following Section 5.3
will mainly address the very important martingale theorem, giving conditions under which
martingales and submartingales converge. In Section 5.4 we shall introduce the concept of
uniform integrability and see how this interplays with martingales. Finally, in Section 5.5
we will prove a central limit theorem for martingales. That is a result that relaxes the
independence assumption from Section 3.5.
134 Martingales
where 0 < p < 1. We will think of Yn as the result of a game where the probability of winning
is p, and where if you bet 1 dollar. and win you receive 1 dollar. and if you lose, you lose
the 1 dollar you bet.
EYn = p − (1 − p) = 2p − 1,
and the game is called favourable if p > 12 , fair if p = 12 and unfavourable if p < 1
2
corresponding to whether EYn is > 0, = or < 0, respectively.
If the player in each game makes a bet of 1, his (signed) winnings after n games will be
Sn = Y1 + · · · + Yn . According to the strong law of large numbers,
1 a.s.
Sn → 2p − 1,
n
so it follows that if the game is favourable, the player is certain to win in the long run (Sn > 0
evt. almost surely) and if the game is unfavourable, the player is certain to lose in the long
run.
A strategy thus allows for the player to take the preceding outcomes into account when he
makes his n’th bet.
5.1 Introduction to martingale theory 135
Note that φ1 is given by X0 alone, making it constant. Further note that it is possible to
let φn = 0, corresponding to the player not making a bet, for instance because he or she has
been winning up to this point and therefore wishes to stop.
Given the strategy (φn ) the (signed) winnings in the n’th game become
Zn = Yn φn (X0 , Y1 , . . . , Yn−1 )
For instance, if p < 21 , the conditional expected value of the capital after the n + 1’st game is
at most Xn , so the game with strategy (φn ) is at best fair. But what if one simply chooses
to focus on the development of the capital at points in time that are advantageous for the
player, and where he or she can just decide to quit?
With 0 < p < 1 infinitely many wins, Yn = 1 and, of course, infinitely many losses, Yn = −1,
will occur with probability 1. Let τk be the time of the k’th win, i.e.
τ1 = inf{n : Yn = 1}
and for k ≥ 1,
τk+1 = inf{n > τk : Yn = 1}.
Each τk can be shown to be a stopping time (see Definition 5.2.6 below) and Theorem 5.2.12
provides conditions for when ’(Xn ) is a supermartingale (fair or unfavourable)’ implies that
136 Martingales
’(Xτk ) is a supermartingale’. The conditions of Theorem 5.2.12 are for instance met if (Xn )
is a supermartingale and we require, not unrealistically, that Xn ≥ a always, where a is some
given constant. (The player has limited credit and any bet made must, even if the player
loses, leave a capital of at least a). It is this result we phrase by stating that it is not possible
to turn an unfavourable game into a favourable one.
Even worse, if p ≤ 21 and we require that Xn ≥ a, it can be shown that if there is a minimum
amount that one must bet (if one chooses to play) and the player keeps playing, he or she
will eventually be ruined! (If p > 21 there will still be a strictly positive probability of ruin,
but it is also possible that the capital will beyond all any number).
The result just stated ’only’ holds under the assumption of, for instance, all Xn ≥ a. As we
shall see, it is in fact easy to specify strategies such that Xτk ↑ ∞ for k ↑ ∞. The problem
is that such strategies may well prove costly in the short run.
A classic strategy is to double the amount you bet every game until you win and then start
all over with a bet of, e.g., 1, i.e.
φ1 (X0 ) = 1,
n−1 (X0 , y1 , . . . , yn−2 ) if yn−1 = −1,
2φ
φn (X0 , y1 , . . . , yn−1 ) =
1 if yn−1 = 1.
Pn−1
If, say, τ1 = n, the player loses k=1 2k−1 in the n − 1 first games and wins 2n−1 in the n’th
game, resulting in the total winnings of
n−1
X
− 2k−1 + 2n−1 = 1.
1
Thus, at the random time τk the total amount won is k and the capital is
Xτk = X0 + k.
But if p is small, one may experience long strings of consecutive losses and Xn can become
very negative.
In the next sections we shall - without referring to gambling - discuss sequences (Xn ) of
random variables for which the inequalities (1.1) hold. A main result is the martingale
convergence theorem (Theorem 5.3.2).
The proof presented here is due to the American probabilist J.L. Doob.
5.2 Martingales and stopping times 137
Let (Ω, F, P ) be a probability space and let (Fn )n≥1 be a sequence of sub σ-algebras of F,
which is increasing Fn ⊆ Fn+1 for all n. We say that (Ω, F, Fn , P ) is a filtered probability
space with filtration (Fn )n≥1 . The interpretation of a filtration is that we think of n as a
point in time and Fn as consisting of the events that are decided by what happens up to and
including time n.
Now let (Xn )n≥1 be a sequence of random variables defined on (Ω, F).
Definition 5.2.1. The sequence (Xn ) is adapted to (Fn ) if Xn is Fn -measurable for all n.
Instead of writing that (Xn ) is adapted to (Fn ) we often write that (Xn , Fn ) is adapted.
Example 5.2.2. Assume that (Xn ) is a sequence of random variables defined on (Ω, F, P ),
and define for each n ∈ N the σ–algebra Fn by
Fn = σ(X1 , . . . , Xn ) .
Definition 5.2.3. An adapted sequence (Xn , Fn ) of real random variables is called a mar-
tingale if for all n ∈ N it holds that E|Xn | < ∞ and
The following lemma will be very useful and has a corollary that gives an equivalent formu-
lation of the submartingale (martingale) property.
138 Martingales
Lemma 5.2.4. Suppose that X and Y are real random variables with E|X| < ∞ and E|Y | <
∞, and let D be a sub σ–algebra of F such that X is D–measurable. Then
if and only if Z Z
Y dP ≥ X dP for all D ∈ D .
D (=) D
Proof. Since both E(Y |D) and X are D–measurable, we have that
if and only if Z Z
E(Y |D) dP ≥ X dP
D (=) D
for all D ∈ D (The ”if” implication should be obvious. For ”only if”, consider the integral
R R
D
(E(Y |D) − X) dP , where D = (E(Y |D) < X)). The left integral above equals D Y dP
because of the definition of conditional expectations, so (5.2) holds if and only if
Z Z
Y dP ≥ X dP
D (=) D
for all D ∈ D.
Corollary 5.2.5. Assume that (Xn Fn ) is adapted with E|Xn | < ∞ for all n ∈ N. Then
(Xn , Fn ) is a submartingale (martingale) if and only if for all n ∈ N
Z Z
Xn+1 dP ≥ Xn dP
F (=) F
for all F ∈ Fn .
When handling martingales and submartingales it is often fruitful to study how they behave
at random time points of a special type called stopping times.
(τ = n) ∈ Fn
Example 5.2.7. Let (Xn ) be a sequence of real random variables, and define the filtration
(Fn ) by Fn = σ(X1 , . . . , Xn ). Assume that τ is a stopping time with respect to this filtration,
and consider the set (τ = n) that belongs to Fn . Since Fn is generated by the vector
(X1 , . . . , Xn ) there exists a set B ∈ Bn such that
(τ = n) = (X1 , . . . , Xn ) ∈ Bn .
The implication of this is, that we are able to read off from the values of X1 , . . . , Xn whether
τ = n or not. So by observing the sequence (Xn ) for some time, we know if τ has occurred
or not. ◦
(τ ≤ n) ∈ Fn
for all n ∈ N.
(τ ≤ n) = ∪nk=1 (τ = k)
Assume conversely, that (τ ≤ n) ∈ Fn for all n. Then the stopping time property follows
from
(τ = n) = (τ ≤ n)\(τ ≤ n − 1) ,
If σ and τ are stopping times then σ ∧ τ, σ ∨ τ and σ + τ are also stopping times. E.g. for
σ ∧ τ write
(σ ∧ τ ≤ n) = (σ ≤ n) ∪ (τ ≤ n)
We now define a σ-algebra Fτ , which consists of all the events that are decided by what
happens up to and including the random time τ .
Then we have
F c ∩ (τ = n) = (τ = n) \ F ∩ (τ = n) ∈ Fn
for all n ∈ N.
With τ a finite stopping time, we consider the process (Xn ) at the random time τ and define
Although definition of Fτ may not seem very obvious, Theorem 5.2.11 below shows that both
Xτ and τ are Fτ –measurable. Hence certain events at time τ are Fτ –measurable, and the
intuitive interpretation of Fτ as consisting of all events up to time τ is still reasonable.
Theorem 5.2.11. If (Xn , Fn ) is adapted and τ is a finite stopping time, then both τ and
Xτ are Fτ -measurable.
For the second statement, let B ∈ B and realize that for all n,
(Xτ ∈ B) ∩ (τ = n) = (Xn ∈ B) ∩ (τ = n) ∈ Fn ,
Theorem 5.2.12 (Optional sampling, first version). Let (Xn , Fn ) be a submartingale (mar-
tingale) and assume that σ and τ are finite stopping times with σ ≤ τ . If E|Xτ | < ∞,
E|Xσ | < ∞ and
Z
+
lim inf XN dP = 0
N →∞ (τ >N )
Z
(lim inf |XN |dP = 0),
N →∞ (τ >N )
then
E(Xτ |Fσ ) ≥ Xσ .
(=)
for all A ∈ Fσ . So we fix A in the following and define Dj = A ∩ (σ = j). If we can show
Z Z
Xτ dP ≥ Xσ dP (5.4)
Dj (=) Dj
In the two equalities we have used dominated convergence: E.g. for the first equality we have
the integrable upper bound |Xτ |, so
Z ∞
Z X Z M
X
Xτ dP = 1Dj Xτ dP = lim 1Dj Xτ dP
A M →∞
j=1 j=1
M
Z X M Z
X ∞ Z
X
= lim 1Dj Xτ dP = lim 1Dj Xτ dP = Xτ dP
M →∞ M →∞ Dj
j=1 j=1 j=1
Hence the argument will be complete if we can show (5.4). For this, first define for N ≥ j
Z Z
IN = Xτ dP + XN dP .
Dj ∩(τ ≤N ) Dj ∩(τ >N )
We claim that
Ij ≤ Ij+1 ≤ Ij+2 ≤ . . .
(=) (=)
So we have
Z Z
IN = Xτ dP + XN dP
Dj ∩(τ <N ) Dj ∩(τ ≥N )
Z Z
≥ Xτ dP + XN −1 dP = IN −1
(=) Dj ∩(τ ≤N −1) Dj ∩(τ >N −1)
and thereby the sequence (IN )N ≥j is shown to be increasing. For the left hand side in (5.4)
this implies that
Z Z Z
Xτ dP = Xτ dP + Xτ dP
Dj Dj ∩(τ ≤N ) Dj ∩(τ >N )
Z Z
+ XN dP − XN dP
Dj ∩(τ >N ) Dj ∩(τ >N )
Z Z
= IN + Xτ dP − XN dP
Dj ∩(τ >N ) Dj ∩(τ >N )
Z Z
≥ Ij + Xτ dP − XN dP
(=) Dj ∩(τ >N ) Dj ∩(τ >N )
Z Z
+
≥ Ij + Xτ dP − XN dP (5.5)
Dj ∩(τ >N ) Dj ∩(τ >N )
Z Z
Ij + Xτ dP − XN dP
Dj ∩(τ >N ) Dj ∩(τ >N )
Dj ∩ (τ ≤ j) = A ∩ (σ = j) ∩ (τ ≤ j) = Dj ∩ (τ = j) ,
so
Z Z
Ij = Xτ dP + Xj dP
Dj ∩(τ ≤j) Dj ∩(τ >j)
Z Z
= Xj dP + Xj dP
Dj ∩(τ =j) Dj ∩(τ >j)
Z Z Z
= Xj dP = Xj dP = Xσ dP (5.6)
Dj A∩(σ=j) Dj
Hence we have shown (5.4) if we can show that the two last terms in (5.5) can be ignored.
Since (τ > N ) ↓ ∅ for N → ∞ and Xτ is integrable, we have from dominated convergence
that
Z Z Z
lim Xτ dP = lim 1Dj ∩(τ >N ) Xτ dP = lim 1Dj ∩(τ >N ) Xτ dP = 0 (5.7)
N →∞ Dj ∩(τ >N ) N →∞ N →∞
And because of the assumption from the theorem, we must have a subsequence of natural
144 Martingales
which is (5.4).
E(Xτ |Fσ ) ≥ Xσ
(=)
Proof. We show that the conditions from Theorem 5.2.12 are fulfilled. There exists K < ∞
such that supω∈Ω τ (ω) ≤ K. Then
Z K
X K Z
X K Z
X K
X
E|Xτ | = 1(τ =k) Xk dP ≤ 1(τ =k) |Xk | dP ≤ |Xk | dP = E|Xk | < ∞ .
k=1 k=1 k=1 k=1
That E|Xσ | < ∞ follows similarly. Furthermore it must hold that (τ > N ) = ∅ for all
N ≥ K, so obviously Z
+
XN dP = 0
(τ >N )
for all N ≥ K.
We can also translate Theorem 5.2.12 into a result concerning the process considered at a
sequence of stopping times. Firstly, we need to specify the sequence of stopping times.
Definition 5.2.14. A sequence (τn )n≥1 of positive random variables is a sequence of sam-
pling times if it is increasing and each τn is a finite stopping time.
5.3 The martingale convergence theorem 145
With (τn ) a sequence of sampling times it holds, according to 1. in Theorem 5.2.10, that
Fτn ⊆ Fτn+1 for all n. If (Xn , Fn ) is adapted then, according to Theorem 5.2.11, Xτn is
Fτn -measurable. Hence, the sampled sequence (Xτn , Fτn ) is adapted.
Theorem 5.2.15. Let (Xn , Fn ) be a submartingale (martingale) and let (τk ) be a sequence
of sampling times. If
E|Xτk | < ∞ for all k, (a)
Z
+
lim inf XN dP = 0 for all k, (b)
N →∞ (τ >N )
k
Z
(lim inf |XN |dP = 0 for all k),
N →∞ (τk >N )
(1) If (Xn , Fn ) is a martingale, then both (|Xn |, Fn ) and (Xn+ , Fn ) are submartingales.
Then
+
E(Xn+1 |Fn ) ≥ E(Xn+1 |Fn )+ = Xn+ a.s. ,
where the inequality follows from (9) in Theorem 4.2.6, since the function x 7→ x+ is convex.
Similarly since x 7→ |x| is convex, (9) in Theorem 4.2.6 gives that
We obtain
+
E(Xn+1 |Fn ) ≥ E(Xn+1 |Fn )+ ≥ Xn+ a.s.
The proof is given below. Note that, cf. Lemma 5.3.1 the sequence EXn+ is increasing, so
the assumption sup EXn+ < ∞ is equivalent to assuming
n
The proof is based on a criterion for convergence of a sequence of real numbers, which we
shall now discuss.
always using the convention inf ∅ = ∞. We now define the number of down-crossings from
b to a for the sequence (xn ) as +∞ if all mk < ∞ and as k if mk < ∞ and mk+1 = ∞ (in
particular, 0 down-crossings in the case m1 = ∞). Note that n1 ≤ m1 ≤ n2 ≤ m2 . . . with
equality only possible, if the common value is ∞.
5.3 The martingale convergence theorem 147
Lemma 5.3.3. The limit x = lim xn exists (as a limit in R = [−∞, ∞]) if and only if for
n→∞
all rational a < b it holds that the number of down-crossings from b to a for (xn ) is finite.
Proof. Firstly, consider the ”only if” implication. So assume that the limit x = limn→∞
exists in R and let a < b be given. We must have that either x > a or x < b (if not both of
them are true). If for instance x > a then we must have some n0 such that
since e.g there exists n0 with |x − xn | < x − a for all n ≥ n0 . But then we must have that
mk = ∞ for k large enough, which makes the number of down–crossings finite. The case
x < b is treated analogously.
Now we consider the ”if” part of the result, so assume that the limit limn→∞ xn does not
exist. Then lim inf n→∞ xn < lim supn→∞ xn , so in particular we can find rational numbers
a < b such that
lim inf xn < a < b < lim sup xn .
n→∞ n→∞
This implies that xn ≤ a for infinitely many n and xn ≥ b for infinitely many n. Especially
we must have that the number of down–crossings from b to a is infinite.
In the following proofs we shall apply the result to a sequence (Xn ) of real random variables.
For this we will need some notation similar to the definition of nk and mk above. Define the
random times
σ1 = inf{n | Xn ≥ b}, τ1 = inf{n > σ1 | Xn ≤ a},
and recursively
which is in Fn , since all X1 , . . . , Xn are Fn –measurable, implying that (Xi > a) ∈ Fn (when
i < n) and (Xn ≤ a) ∈ Fn . Furthermore if e.g. σk is a stopping time, then
n−1
\
(τk = n) = (σk = j) ∩ (Xj+1 > a, . . . , Xn−1 > a, Xn ≤ a)
j=1
148 Martingales
which belongs to Fn , since (σk = j) ∈ Fn and all the X–variables involved are Fn –
measurable. Hence τk is a stopping time as well.
Let βab (ω) be the number of down–crossings from b to a for the sequence (Xn (ω)) for each
ω ∈ Ω. Then we have (we suppress the ω)
∞
X
βab = 1(τk <∞)
k=1
so we see that βab is an integer–valued random variable. With this notation, we can also
N
define the number of down–crossings βab from b to a in the time interval {1, . . . , N } as
∞
X N
X
N
βab = 1(τk ≤N ) = 1(τk ≤N ) .
k=1 k=1
N
Finally note, that βab ↑ βab as N → ∞.
Proof of Theorem 5.3.2. In order to show that X = lim Xn exists as a limit in R it is,
according to Lemma 5.3.3, sufficient to show that
But it follows directly from the down–crossing lemma (Lemma 5.3.4), that for all rational
pairs a < b we have P (βab < ∞) = 1. Hence also
\
1=P (βab < ∞) = P (βab is finite for all rational a < b) .
a<b rational
We still need to show that E|X| < ∞. In the affirmative, we will also know that X is finite
almost surely. Otherwise we would have
P (|X| = ∞) = > 0
such that
E|X| ≥ E(1(|X|=∞) · ∞) = ∞ · = ∞
which is a contradiction.
E|X| ≤ lim inf E|Xn | ≤ 2 lim inf EXn+ − EX1 = 2 lim EXn+ − EX1 < ∞ ,
n→∞ n→∞ n→∞
The most significant and most difficult part of the proof of Theorem 5.3.2 is contained in the
next result.
Proof. The last claim follows directly from (5.9) and the inequality (Xn − b)+ ≤ Xn+ + |b|,
since it is assumed in the theorem that sup EXn+ = lim EXn+ < ∞.
Note that for all N, k ∈ N it holds that 1(σk ≤N ) = 1(τk ≤N ) + 1(σk ≤N <τk ) . Then we can write
N
X N
X
(Xτk ∧N − Xσk ∧N )1(σk ≤N ) = (Xτk ∧N − Xσk ∧N )(1(τk ≤N ) + 1(σk ≤N <τk ) )
k=1 k=1
N
X N
X
= (Xτk − Xσk )1(τk ≤N ) + (XN − Xσk )1(σk ≤N <τk )
k=1 k=1
N
X N
X
≤ (a − b) 1(τk ≤N ) + (XN − b) 1(σk ≤N <τk )
k=1 k=1
N
≤ (a − b)βab + (XN − b)+ .
150 Martingales
In the first inequality we have used, that if τk < ∞, then Xσk ≥ b and Xτk ≤ a, and also if
only σk < ∞ it holds that Xσk ≥ b. By rearranging the terms above, we obtain the inequality
N
N (XN − b)+ 1 X
βab ≤ − (Xτk ∧N − Xσk ∧N )1(σk ≤N ) . (5.10)
b−a b−a
k=1
Note that for each k we have that σk ∧ N ≤ τk ∧ N are bounded stopping times. Hence
Corollary 5.2.13 yields that E|Xσk ∧N | < ∞, E|Xτk ∧N | < ∞ and E(Xτk ∧N |Fσk ∧N ) ≥ Xσk ∧N .
Then according to Lemma 5.2.4 it holds that
Z Z
Xτk ∧N dP ≥ Xσk ∧N dP
F F
for all n ∈ N, we must have that (σk ≤ N ) ∈ Fσk ∧N which implies that
Z Z
E (Xτk ∧N − Xσk ∧N )1(σk ≤N ) = Xτk ∧N dP − Xσk ∧N dP ≥ 0 .
(σk ≤N ) (σk ≤N )
N E(XN − b)+
Eβab ≤
b−a
Finally note that (Xn − b, Fn ) is a submartingale, since (Xn , Fn ) is a submartingale. Hence
the sequence ((Xn − b)+ , Fn ) will be a submartingale as well according to 2. in Lemma
5.3.1, such that E(XN − b)+ is increasing and thereby convergent. So applying monotone
N
convergence to the left hand side (recall βab ↑ βab ) leads to
1
Eβab ≤ lim E(Xn − b)+ .
b − a n→∞
2. If (Xn , Fn ) is a martingale and there exists c ∈ R such that either Xn ≤ c a.s. for all n
or Xn ≥ c a.s. for all n, then X = lim Xn exists almost surely and E|X| < ∞.
5.4 Martingales and uniform integrability 151
a.s.
If (Xn , Fn ) is a martingale then in particular EXn = EX1 for all n, but even though Xn → X
and E|X| < ∞, it is not necessarily true that EX = EX1 . We shall later obtain some results
where not only EX = EX1 but where in addition, the martingale (Xn , Fn ) has a number of
other attractive properties. Let (Xn )n≥1 , X be real random variables with E|Xn | < ∞, ∀n
L1
and E|X| < ∞. Recall that by Xn −→ X we mean that
lim E|Xn − X| = 0,
n→∞
Definition 5.4.1. A family (Xi )i∈I of real random variables is uniformly integrable if
E|Xi | < ∞ for all i ∈ I and
Z
lim sup |Xi |dP = 0.
x→∞ I (|Xi |>x)
Example 5.4.2. (1) The family {X} consisting of only one real variable is uniformly inte-
grable if E|X| < ∞:
Z Z
lim |X| dP = lim 1(|X|>x) |X| dP = 0
x→∞ (|X|>x) x→∞
(2) Now consider a finite family (Xi )i=1,...,n of real random variables. This family is uniformly
integrable, if each E|Xi | < ∞:
Z Z
lim sup |Xi | dP = lim max |Xi | dP = 0 ,
x→∞ i∈I (|Xi |>x) x→∞ i=1,...,n (|Xi |>x)
Example 5.4.3. Let (Xi )i∈I be a family of real random variables. If supi∈I E|Xi |r < ∞ for
some r > 1, then (Xi ) is uniformly integrable:
Z Z r−1
Z
Xi 1
|Xi | dP ≤ |Xi | dP = |Xi |r dP
(|Xi |>x) (|Xi |>x) x xr−1 (|Xi |>x)
1 1
≤ E|Xi |r ≤ sup E|Xj |r
xr−1 xr−1 j∈I
so we obtain Z
1
sup |Xi | dP ≤ sup E|Xj |r
i∈I (|Xi |>x) xr−1 j∈I
Theorem 5.4.4. The family (Xi )i∈I is uniformly integrable if and only if,
Proof. First we demonstrate the ”only if” statement. So assume that (Xi ) is uniformly
integrable. For all x > 0 we have for all i ∈ I that
Z Z Z
E|Xi | = |Xi | dP = |Xi | dP + |Xi | dP
Ω (|Xi |≤x) (|Xi |>x)
Z Z
≤ xP (|Xi | ≤ x) + |Xi | dP ≤ x + |Xi | dP
(|Xi |>x) (|Xi |>x)
so also Z
sup E|Xi | ≤ x + sup |Xi | dP .
i∈I i∈I (|Xi |>x)
To show (2) let > 0 be given. Then for all A ∈ F we have (with a similar argument to the
one above)
Z Z Z
|Xi | dP = |Xi | dP + |Xi | dP
A A∩(|Xi |≤x) A∩(|Xi |>x)
Z
≤ xP (A ∩ (|Xi | ≤ x)) + |Xi | dP
A∩(|Xi |>x)
Z
≤ xP (A) + |Xi | dP
(|Xi |>x)
so Z Z
sup |Xi | dP ≤ xP (A) + sup |Xi | dP .
i∈I A i∈I (|Xi |>x)
Now choose x = x0 > 0 according to the assumption of uniform integrability such that
Z
sup |Xi | dP < .
i∈I (|Xi |>x0 ) 2
Then for A ∈ F with P (A) < 2x0we must have
Z
sup |Xi | dP < x0 + =
i∈I A 2x0 2
so if we choose δ = 2x0 we have shown (2).
For the ”if” part of the theorem, assume that both (1) and (2) hold. Assume furthermore
that it is shown that
lim sup P (|Xi | > x) = 0 . (5.11)
x→∞ i∈I
In order to obtain that the definition of uniform integrability is fulfilled, let > 0 be given.
We want to find x0 > 0 such that
Z
sup |Xi | dP ≤
i∈I (|Xi |>x)
for x > x0 . Find the δ > 0 corresponding to according to (2) and then let x0 satisfy that
for x > x0 . Now for all x > x0 and i ∈ I we have P (|Xi | > x) < δ such that (because of (2))
Z Z
|Xi | dP ≤ sup |Xj | dP ≤
(|Xi |>x) j∈I (|Xi |>x)
so also Z
sup |Xi | dP ≤ .
i∈I (|Xi |>x)
154 Martingales
Hence the proof is complete, if we can show (5.11). But for x > 0 it is obtained from Markov’s
inequality that
1
sup P (|Xi | > x) ≤ sup E|Xi |
i∈I x i∈I
and the last term → 0 as x → ∞, since supi∈I E|Xi | is finite by the assumption (1).
The next result demonstrates the importance of uniform integrability if one aims to show L1
convergence.
Theorem 5.4.5. Let (Xn )n≥1 , X be real random variables with E|Xn | < ∞ for all n. Then
L1 P
E|X| < ∞ and Xn −→ X if and only if (Xn )n≥1 is uniformly integrable and Xn → X.
L1 P
Proof. Assume that E|X| < ∞ and Xn −→ X. Then in particular Xn −→ X.
We will show that (Xn ) is uniformly integrable by showing (1) and (2) from Theorem 5.4.4
are satisfied.
so
sup E|Xn | ≤ E|X| + sup E|Xn − X| .
n≥1 n≥1
Since each E|Xn − X| ≤ E|Xn | + E|X| < ∞ and the sequence (E|Xn − X|)n∈N converges to
0 (according to the L1 –convergence), it must be bounded, so supn≥1 E|Xn − X| < ∞.
Now let > 0 be given and choose n0 so that E|Xn − X| < 2 for n > n0 . Furthermore (since
the one–member family {X} is uniformly integrable) we can choose δ1 > 0 such that
Z
|X| dP < if P (A) < δ1 .
A 2
Then we have Z Z
|Xn | dP ≤ E|Xn − X| + |X| dP <
A A
whenever n > n0 and P (A) < δ1 .
5.4 Martingales and uniform integrability 155
Now choose δ2 > 0 (since the finite family (Xn )1≤n≤n0 is uniformly integrable) such that
Z
max |Xn | dP <
1≤n≤n0 A
R
if P (A) < δ2 . We have altogether with δ = δ1 ∧δ2 that for all n ∈ N it holds that A
|Xn | dP <
if P (A) < δ, and this shows (2) since then
Z
sup |Xn | dP ≤ if P (A) < δ .
n≥1 A
P
For the converse implication, assume that (Xn )n≥1 is uniformly integrable and Xn −→ X.
a.s.
Then we can choose (according to Theorem 1.2.13) a subsequence (nk )k≥1 such that Xnk −→
X. From Fatou’s Lemma and the fact that (1) is satisfied we obtain
E|X| = E lim |Xnk | = E lim inf |Xnk | ≤ lim inf E|Xnk | ≤ sup E|Xnk | ≤ sup E|Xn | < ∞ .
k→∞ k→∞ k→∞ k≥1 n≥1
In order to show that E|Xn − X| → 0 let > 0 be given. We obtain for all n ∈ N
Z Z
E|Xn − X| = |Xn − X| dP + |Xn − X| dP
(|Xn −X|≤ 2 ) (|Xn −X|> 2 )
Z
≤ + |Xn − X| dP
2 (|Xn −X|> 2 )
Z Z
≤ + |Xn | dP + |X| dP
2 (|Xn −X|> 2 ) (|Xn −X|> 2 )
In accordance with (2) applied to the two families (Xn ) and (X) choose δ > 0 such that
Z Z
sup |Xm | dP < and |X| dP <
m∈N A 4 A 4
P
for all A ∈ F with P (A) < δ. Since Xn −→ X we can find n0 ∈ N such that P (|Xn − X| >
2 ) < δ for n ≥ n0 . Then for n ≥ n0 we have
Z Z
|Xn | dP ≤ sup |Xm | dP <
(|Xn −X|> 2 )
m≥1 (|Xn −X|> 2 ) 4
Z
|X| dP <
(|Xn −X|> 2 ) 4
After this digression into the standard theory of integration we now return to adapted se-
quences, martingales and submartingales.
for all n ≥ 1.
Theorem 5.4.7. (1) Let (Xn , Fn ) be a submartingale. If (Xn+ )n≥1 is uniformly integrable,
L1
then X = lim Xn exists almost surely, X closes the the submartingale and Xn+ −→ X + .
n→∞
A sufficient condition for uniform integrability of (Xn+ ) is the existence of a random variable
Y that closes the submartingale.
(2) Let (Xn , Fn ) be a martingale. If (Xn ) is uniformly integrable, then X = lim Xn exists
n→∞
L1
almost surely, X closes the martingale and Xn −→ X.
A sufficient condition for uniform integrability of (Xn ) is the existence of a random variable
Y that closes the martingale.
Proof. (1) We start with the final claim, so assume that there exists Y such that
and taking expectations on both sides yields EXn+ ≤ EY + so also supn EXn+ ≤ EY + < ∞.
Then according to the martingale convergence theorem (Theorem 5.3.2) we have that X =
limn→∞ Xn exists almost surely. Since (Xn (ω))n→∞ is convergent for almost all ω ∈ Ω it is
especially bounded (not by the same constant for each ω), and in particular supn Xn+ (ω) < ∞
for almost all ω. We obtain that for all x > 0 and n ∈ N
Z Z Z Z
Xn+ dP ≤ E(Y + |Fn ) dP = Y + dP ≤ Y + dP ,
+ + +
(Xn >x) (Xn >x) (Xn >x) (supk Xk+ >x)
5.4 Martingales and uniform integrability 157
where the first inequality is due to (5.12) and the equality follows from the definition of
conditional expectations, since Xn is Fn –measurable such that (Xn+ > x) ∈ Fn . Since this
is true for all n ∈ N we have
Z Z
sup Xn+ dP ≤ Y + dP
+ +
n∈N (Xn >x) (supn Xn >x)
Furthermore we have that Y + is integrable with 1(supn Xn+ >x) Y + ≤ Y + for all x, and obviously
since supn Xn+ < ∞ almost surely, we have 1(sup Xn+ >x) Y + → 0 a.s. when x → ∞. Then
from Dominated convergence the right hand integral above will → 0 as x → ∞. Hence we
have shown that (Xn+ ) is uniformly integrable.
Now we return to the first statement, so assume that (Xn+ ) is uniformly integrable. In
particular (according to 1 in Theorem 5.4.4) we have supn EXn+ < ∞. Then The Martingale
convergence Theorem yields that X = limn→∞ Xn exists almost surely with E|X| < ∞.
a.s. L1
Obviously we must also have Xn+ −→ X + , and then Theorem 5.4.5 implies that Xn+ −→ X + .
for all n ∈ N it is equivalent (according to Lemma 5.2.4) to show that for all n ∈ N it holds
Z Z
Xn dP ≤ X dP (5.13)
F F
for all F ∈ Fn . For F ∈ Fn and n ≤ N we have (according to Corollary 5.2.5, since (Xn , Fn )
is a submartingale)
Z Z Z Z
+ −
Xn dP ≤ XN dP = XN dP − XN dP .
F F F F
+ L1
Since it is shown that XN −→ X + we have from the remark just before Definition 5.4.1 that
Z Z
+
lim XN dP = X + dP
N →∞ F F
− a.s. a.s.
Furthermore Fatou’s lemma yields (when using that XN −→ X so XN −→ X − and thereby
−
lim inf N →∞ XN = X − a.s.)
Z Z Z Z
− − −
lim sup − XN dP = − lim inf XN dP ≤ − lim inf XN dP = − X − dP .
N →∞ F N →∞ F F N →∞ F
158 Martingales
When combining the obtained inequalities we have for all n ∈ N and F ∈ Fn that
Z Z Z
+ −
Xn dP ≤ lim sup XN dP − XN dP
F N →∞ F F
Z Z
+ −
= lim XN dP + lim sup − XN dP
N →∞ F N →∞ F
Z Z Z
+ −
≤ X dP − X dP = X dP
F F F
(2) Once again we start proving the last claim, so assume that Y closes the martingale, so
E(Y |Fn ) = Xn a.s. for all n ∈ N. From (10) in Theorem 4.2.6 we have
Similar to the argument in 1 we then have supn≥1 E|Xn | ≤ E|Y | < ∞ so in particular
supn≥1 EXn+ < ∞. Hence X = limn→∞ Xn exists almost surely leading to the fact that
supn≥1 |Xn | < ∞ almost surely. Then for all n ∈ N and x > 0
Z Z Z Z
|Xn | dP ≤ E(|Y ||Fn ) dP = |Y | dP ≤ |Y | dP ,
(|Xn |>x) (|Xn |>x) (|Xn |>x) (supk |Xk |>x)
Finally assume that (Xn ) is uniformly integrable. Then supn E|Xn | < ∞ (from Theorem
5.4.4) and in particular supn EXn+ < ∞. According to the martingale convergence theorem
a.s. L1
we have Xn −→ X with E|X| < ∞. From Theorem 5.4.5 we have Xn −→ X which leads to
(see the remark just before Definition 5.4.1)
Z Z
lim XN dP = X dP .
N →∞ F F
for all F ∈ F. We also have from the martingale property of (Xn , Fn ) that for all n ≤ N
and F ∈ Fn Z Z
Xn dP = XN dP
F F
so we must have Z Z
Xn dP = X dP
F F
for all n ∈ N and F ∈ Fn . This shows that E(X|Fn ) = Xn a.s., so X closes the martingale.
5.4 Martingales and uniform integrability 159
An important warning is the following: Let (Xn , Fn ) be a submartingale and assume that
a.s. L1
(Xn+ )n≥1 is uniformly integrable. As we have seen, we then have Xn −→ X and Xn+ −→ X + ,
L1
but we do not in general have Xn −→ X. If, e.g., (Xn− )n≥1 is also uniformly integrable, then
L1 a.s. L1
it is true that Xn −→ X since Xn− −→ X − implies that Xn− −→ X − by Theorem 5.4.5 and
then E|Xn − X| ≤ E|Xn+ − X + | + E|Xn− − X − | → 0.
Theorem 5.4.8. Let (Fn )n∈N be a filtration on (Ω, F, P ) and let Y be a random variable
with E|Y | < ∞. Define for all n ∈ N
Xn = E(Y |Fn ) .
X = E(Y |F∞ )
we see that (Xn , Fn ) is a martingale. It follows directly from the definition of Xn that Y
closes the martingale, so by 2 in Theorem 5.4.7 it holds that X = lim Xn exists a.s. and that
L1
Xn −→ X.
For this, first note that X = limn→∞ Xn can be chosen F∞ –measurable, since
∞ [
∞ ∞
\ \ 1
F = ( lim Xn exists) = |Xn − Xm | ≤ ∈ F∞ ,
n→∞ N
N =1 k=1 m,n=k
L1
since XN = E(Y |FN ) and F ∈ Fn ⊆ FN . Furthermore we have XN −→ X, so
Z Z
X dP = lim XN dP
F N →∞ F
In conclusion of this chapter, we shall discuss an extension of the optional sampling theorem
(Theorem 5.2.12).
Let (Xn , Fn )n≥1 be a (usual) submartingale or martingale, and let τ be an arbitrary stopping
time. If lim Xn = X exists a.s., we can define Xτ even if τ is not finite as is assumed in
Theorem 5.2.12:
X
τ (ω) (ω) if τ (ω) < ∞
Xτ (ω) =
X(ω) if τ (ω) = ∞.
With this, Xτ is a Fτ -measurable random variable which is only defined if (Xn ) converges
a.s.
if (Xn , Fn ) is a martingale.
Theorem 5.4.9 (Optional sampling, full version). Let (Xn , Fn ) be a submartingale (mar-
tingale) and let σ ≤ τ be stopping times. If one of the three following conditions is satisfied,
then E|Xτ | < ∞ and
E(Xτ |Fσ ) ≥ Xσ .
(=)
(1) τ is bounded.
(2) τ < ∞, E|Xτ | < ∞ and Z
+
lim inf XN dP = 0
N →∞ (τ >N )
Z
(lim inf |XN |dP = 0).
N →∞ (τ >N )
If 3 holds, then lim Xn = X exists a.s. with E|X| < ∞ and for an arbitrary stopping time τ
it holds that
E|Xτ | ≤ 2EX + − EX1 . (5.17)
Proof. That the conclusion is true under assumption (1) is simply Corollary 5.2.13. Com-
paring condition (2) with the conditions in Theorem 5.2.12 shows, that if (2) implies that
E|Xσ | < ∞, then the argument concerning condition (2) is complete. For this consider the
increasing sequence of bounded stopping times given by (σ ∧ n)n≥1 . For each n the pair of
stopping times σ ∧ n ≤ σ ∧ (n + 1) fulfils the conditions of Corollary 5.2.13, so E|Xσ∧n | < ∞,
E|Xσ∧(n+1) | < ∞, and
E(Xσ∧(n+1) |Fσ∧n ) ≥ Xσ∧n a.s. ,
(=)
which shows that the adapted sequence (Xσ∧n , Fσ∧n ) is a submartingale (martingale). We
have similarly that the pair of stopping times σ ∧ n ≤ τ fulfils the conditions from Theo-
rem 5.2.12 (because of the assumption (2) and that E|Xσ∧n | < ∞ as argued above). Hence
the theorem yields that for each n ∈ N we have
which shows that Xτ closes the submartingale (martingale) (Xσ∧n , Fσ∧n ). Hence according
to Theorem 5.4.7 it converges almost surely, where the limit is integrable. Since obviously,
a.s.
Xσ∧n −→ Xσ we conclude that E|Xσ | < ∞ as desired.
We finally show that (3) implies (5.15) if (Xn , Fn ) is a submartingale. That (3) implies
(5.16) if (Xn , Fn ) is a martingale, is then seen as follows: From fact that (|Xn |) is uniformly
162 Martingales
integrable we have that both (Xn+ ) and (Xn− ) are uniformly integrable. Since both (Xn , Fn )
and (−Xn , Fn ) are submartingales (with (Xn+ ) and ((−Xn )+ ) uniformly integrable) the result
for submartingales can be applied to obtain
So, assume that (Xn , Fn ) is a submartingale and that (Xn+ ) is uniformly integrable. From (1)
in Theorem 5.4.7 we have that X = limn→∞ Xn exists almost surely and that X closes the
submartingale (Xn , Fn ). Since (Xn+ , Fn ) is a submartingale as well we can apply Theorem
5.4.7 again and obtain that X + closes the submartingale (Xn+ , Fn ). Using the notation
+
X∞ = X + we get
Z X X Z
+ +
EXτ = 1(τ =n) Xn dP = Xn+ dP
1≤n≤∞ 1≤n≤∞ (τ =n)
X Z Z X Z
+ +
≤ X dP = 1(τ =n) X dP = X + dP = EX +
1≤n≤∞ (τ =n) 1≤n≤∞
at the inequality we have used Lemma 5.2.4 and (τ = n) ∈ Fn since E(X + |Fn ) ≥ Xn+ . Let
N ∈ N. Then τ ∧ N is a bounded stopping time with τ ∧ N ↑ τ as N → ∞. Applying the
inequality above to τ ∧ N yields
EXτ+∧N ≤ EX + . (5.18)
Furthermore we have according to part (1) in the theorem (since 1 ≤ τ ∧ N are bounded
stopping times), that
EX1 ≤ EXτ ∧N . (5.19)
Then combining (5.18) and (5.19) gives
Thereby we have
E|Xτ | = EXτ+ + EXτ− ≤ 2EX + − EX1
which in particular is finite.
E(Xτ |Fσ ) ≥ Xσ
5.4 Martingales and uniform integrability 163
for all F ∈ Fσ and k ∈ N since in that case we can obtain (using dominated convergence,
since E|Xτ | < ∞ and E|Xσ | < ∞)
Z X Z X Z Z
Xσ dP = Xk dP ≤ Xτ dP = Xτ dP
F 1≤k≤∞ F ∩(σ=k) 1≤k≤∞ F ∩(σ=k) F
So we will show the inequality (5.20). The Theorem in the 1–case yields (since σ ∧ N ≤ τ ∧ N
are bounded stopping times) that E(Xτ ∧N |Fσ∧N ) ≥ Xσ∧N a.s. Now let Fk = F ∩ (σ = k)
and assume that N ≥ k. Then Fk ∈ Fσ∧N :
(
F ∩ (σ = n) ∈ Fn n = k
Fk ∩ (σ ∧ N = n) =
∅ ∈ Fn n 6= k
We will spend the rest of the proof on finding an upper limit of the two terms, as N → ∞.
Considering the first term, we have from dominated convergence (since |1Fk ∩(τ ≤N ) Xτ | ≤ |Xτ |,
Xτ is integrable, and (τ ≤ N ) ↑ (τ < ∞)) that
Z Z
lim Xτ dP = Xτ dP
N →∞ Fk ∩(τ ≤N ) Fk ∩(τ <∞)
For the second term we will use that Fk ∈ Fσ∧N ⊆ FN and (τ > N ) = (τ ≤ N )c ∈ FN , such
that Fk ∩ (τ > N ) ∈ FN . Then, since X closes the submartingale (Xn , Fn ), we obtain
Z Z
XN dP ≤ X dP
Fk ∩(τ >N ) Fk ∩(τ >N )
164 Martingales
The principle, that sums of independent random variables almost follow a normal distribution,
is sound. But it underestimates the power of central limit theorems: Sums of dependent
variables are very often approximately normal as well. Many common dependence structures
are weak in the sense the terms may be strongly dependent on a few other variables, but
almost independent of the major part of the variables. Hence the sum of such variables will
have a probabilistic structure resembling the sum of independent variables.
An important theme of the probability theory in the 20th century has been the refinement of
this loose reasoning. How should ”weak dependence” be understood, and how is it possible to
inspect the difference between the sum of interest and the corresponding sum of independent
variables? Typically, smaller and rather specific classes of models have been studied, but the
general drive has been missing. Huge amounts of papers exists focusing on
The big turning point was reached around 1970 when a group of mathematicians, more or
less independent of each other, succeeded in stating and proving central limit theorems in the
frame of martingales. Almost all later work in the area have been based on the martingale
results.
5.5 The martingale central limit theorem 165
In the following we will consider a filtered probability space (Ω, F, (Fn ), P ), and for notational
reasons we will often need a ”time 0” σ–algebra F0 . In the lack of any other suggestions, we
will use the trivial σ–algebra
F0 = {∅, Ω} .
Definition 5.5.1. A real stochastic process (Xn )n≥1 is a martingale difference, relative to
the filtration (Fn )n≥1 , if
is a martingale, relative to the same filtration, and all the variables in this martingale will
have mean 0. Conversely, if (Sn )n≥1 is a martingale with all variables having mean 0, then
X1 = S1 , Xn = Sn − Sn−1 for n = 2, 3, . . .
Example 5.5.2. If the variables (Xn )n≥1 are independent and all have mean 0, then the
sequence form a martingale difference with respect to the natural filtration
Fn = σ(X1 , . . . , Xn ) ,
since
a.s.
E(Xn | Fn−1 ) = E(Xn | X1 , . . . , Xn−1 ) = E(Xn ) = 0 for all n ≥ 1 .
The martingale corresponding to this martingale difference is what we normally interpret as
a random walk. ◦ We will typically be interested in square-integrable martingale differences,
that is martingale differences (Xn )n≥1 such that
In the terminology of martingales the process (Wn )n≥1 is often denoted the compensator for
the martingale (Sn )n≥1 . It is easily shown that
Sn2 − Wn
is a martingale. It should be noted that in the case of a random walk, where the X–variables
are independent, then the compensator is non–random, more precisely
n
X
2
Wn = E Xm .
m=1
We shall study the so–called martingale difference arrays, abbreviated MDA’s. These are
triangular arrays (Xnm ) of real random variables,
X11
X21 X22
X31 X32 X33
.. .. .. ..
. . . .
To avoid heavy notation we assume that the same fixed filtration (Fn )n≥1 is used in all rows.
In principle, it had been possible to use an entire triangular array of σ-algebras (Fnm ), since
we will not need anywhere in the arguments that the σ-algebras in different rows a related,
but in practice the higher generality will not be useful at all.
Under these notation–dictated conditions, the assumptions for being a MDA will be
Usually we will assume that all the variables in the array have second order moments. Sim-
ilarly to the notation in Section 3.5 we introduce the cumulated sums within rows, defined
5.5 The martingale central limit theorem 167
by
m
X
Snm = Xnk for n ≥ 1, m = 1, . . . , n .
k=1
A central limit theorem in this framework will be a result concerning the full row sums Snn
converging in distribution towards a normal distribution as n → ∞.
In Section 3.5 we saw that under a condition of independence within rows a central limit
theorem is constructed by demanding that the variance of the row sums converges towards a
fixed constant and that the terms in the sums are sufficiently small (Lyapounov’s conditions
or Lindeberg’s condition).
When generalizing to martingale difference arrays, it is still important to ensure that the
terms are small. But the condition concerning convergence of the variance of the row sums
is changed substantially. The new condition will be that the compensators of the rows
n
X
2
E(Xnm | Fm−1 ) , (5.21)
m=1
(that are random variables) converge in probability towards a non–zero constant. This con-
stant will serve as the variance in the limit normal distribution. Without loss of generality,
we shall assume that this constant is 1.
In order to ease notation, we introduce the conditional variances of the variables in the array
2
Vnm = E(Xnm | Fm−1 ) for n ≥ 1, m = 1, . . . , n ,
representing the compensators within rows. Note that the Vnm ’s are all non–negative (almost
surely), and that Wnm thereby grows as m increases. Furthermore note, that Wnm is Fm−1 –
measurable.
We will use repeatedly that for a bounded sequence of random variables, that converges in
probability to constant, the integrals will also converge:
168 Martingales
Lemma 5.5.3. Let (Xn ) be a sequence of real random variables such that |Xn | ≤ C for all
P
n ≥ 1 and some constant C. If Xn −→ x, then EXn → x as well.
Proof. Note that (Xn )n≥1 is uniformly integrable, and thereby it converges in L1 . Hence the
convergence of integrals follows.
Lemma 5.5.4. Let (Xn m ) be a triangular array consisting of real random variables with
third order moment. Assume that there exists a filtration (Fn )n≥1 , making each row in the
array a martingale difference. Assume furthermore that
n
P
X
2
E(Xnm | Fm−1 ) → 1 as n → ∞ .
m=1
If
n
X
2
E(Xnm | Fm−1 ) ≤ 2 a.s. for all n ≥ 1, (5.22)
m=1
wk
Snn −→ N (0, 1) as n → ∞ .
Note that it is not important which upper bound is used in (5.22) - the number 2 could be
replaced by any constant c > 1 without changing the proof and without changing the utility
of the lemma.
for each t ∈ R. That is seen from the following: Let φn (t) be the characteristic function of
Snn , then we have
Z Z Z
t2 /2 i Snn t+t2 /2 i Snn t+Wnn t2 /2 2 2
φn (t) e = e dP = e dP + ei Snn t (et /2 − eWnn t /2 ) dP .
5.5 The martingale central limit theorem 169
P
Since we have assumed that Wnn −→ 1 and the function x 7→ exp(t2 /2) − exp(xt2 /2) is
continuous in 1 (with the value 0 in 1), we must have that
2 2 P
et /2
− eWnn t /2
−→ 0 .
Then
2 2 2 2
P eiSnn t et /2
− eWnn t /2
> = P et /2 − eWnn t /2 > → 0
so the integrand in the last integral above converges to 0 in probability. Furthermore recall
2
that Wnn is bounded by 2, so the integrand must be bounded by et . Then the integral
converges to 0 as n → ∞ because of Lemma 5.5.3. So if (5.24) is shown, we will obtain that
2
φn (t)et /2 → 1 as n → ∞ which is equivalent to having obtained
2
φn (t) → e−t /2
as n → ∞ .
wk
and according to the Theorem 3.4.20 we have thereby shown that Snn −→ N (0, 1).
and we observe
2
Qnm = eiXn m t Q̃n m Qn (m−1) = e−Vn m t /2
Q̃nm
such that
n
X 2
eiXnm t − e−Vnm t /2
Qnn − 1 = Q̃nm .
m=1
Recall from the definitions, that both Sn (m−1) and Wnm are Fm−1 –measurable, such that
Q̃nm is Fm−1 –measurable as well. Then we can write
n
X 2
E (eiXnm t − e−Vnm t /2
E(Qnn − 1) = )Q̃nm
m=1
n
X 2
E E (eiXnm t − e−Vn m t /2
= )Q̃nm |Fm−1
m=1
n
X 2
E E (eiXnm t − e−Vn m t /2
= )|Fm−1 Q̃nm
m=1
170 Martingales
And if we apply the upper bounds for the remainder terms, we obtain
2
E eiXnm t − e−Vnm t /2 |Fm−1 = |E(r1 (Xnm t)|Fm−1 ) − r2 (Vnm t2 )|
That the first sum above converges to 0, is simply the Lyapounov condition that is assumed
to be true in the lemma. Hence the proof will be complete, if we can show that
n
X
2
EVnm →0
m=1
5.5 The martingale central limit theorem 171
which (because of the inequality (5.26) ) will be the case, if we can show
n
X P
max Vnk Vnm −→ 0 .
k=1,...,n
m=1
Pn P
Since we have assumed that m=1 Vnm −→ 1, it will according to Theorem 3.3.3 be enough
to show that
P
max Vnk −→ 0 . (5.27)
k=1,...,n
In order to show (5.27) we will utilize the fact that for each c > 0 exists a d > 0 such that
So let c > 0 be some arbitrary number and find the corresponding d > 0. Then
2
Vnm = E(Xnm |Fm−1 ) ≤ E(c + d|Xnm |3 |Fm−1 )
n
X
= c + dE(|Xnm |3 |Fm−1 ) ≤ c + d E(|Xnm |3 |Fm−1 ) ,
m=1
and since this upper bound does not depend on m, we have the inequality
n
X
max Vnm ≤ c + d E(|Xnm |3 |Fm−1 )
m=1,...,n
m=1
Since c > 0 was chosen arbitrarily, and the left hand side is independent of c, we must have
(recall that it is non–negative)
And since E maxm=1,...,n Vnm = E| maxm=1,...,n Vnm − 0|, we actually have that
L1
max Vnm −→ 0
m=1,...,n
Theorem 5.5.5 (Brown). Let (Xnm ) be a triangular array of real random variables with
third order moment. Assume that there exists a filtration (Fn )n≥1 that makes each row in
the array a martingale difference. Assume furthermore that
n
P
X
2
E(Xnm | Fm−1 ) → 1 as n → ∞ . (5.28)
m=1
D
Snn −→ N (0, 1) as n → ∞
Proof. Most of the work is already done in lemma 5.5.4 - we only need to use a little bit of
martingale technology in order to reduce the general setting to the situation in the lemma.
Analogous to the Wnm -variables from before we define the cumulated third order moments
within each row
m
X
E |Xnk |3 | Fk−1 .
Znm =
k=1
5.5 The martingale central limit theorem 173
It is not important exactly which upper limit is chosen above for the Z–variables – any
strictly positive upper limit would give the same results as 1. The trick will be to see that
∗
the triangular array (Xnm ) consisting of the star variables fulfils the conditions from Lemma
5.5.4.
Note that since both Wnm and Znm are Fm−1 –measurable, then the indicator function will
∗
be Fm−1 –measurable. Hence each Xnm must be Fm –measurable. Furthermore (using that
(Xnm ) is a martingale difference array)
∗
E|Xnm | = E|Xnm 1(Wnm ≤2,Znm ≤1) | ≤ E|Xnm | < ∞
and also
∗
E(Xnm |Fm−1 ) = E(Xnm |Fm−1 )1(Wnm ≤2,Znm ≤1) = 0 .
∗
Altogether this shows that (Xnm ) is a martingale difference array. We will define the variables
∗ ∗ ∗
Vnm and Wnm for (Xnm ) similar to the variables Vnm and Wnm for (Xnm ):
m
X
∗ ∗ ∗ ∗
Vnm = E((Xnm )2 |Fm−1 ) , Wnm = Vnk .
k=1
because, since both Wnk and Znk are increasing in k, all the indicator functions 1(Wnk ≤2,Znk ≤1)
P P
are 1 on the set (Wnn ≤ 2, Znn ≤ 1). Since Wnn −→ 1 and Znn −→ 0 it holds that
P (Wnn ≤ 2) = P (|Wnn − 1| ≤ 1) → 1 and P (Znn ≤ 1) = P (|Znn − 0| ≤ 1) → 1. Hence also
P (Wnn ≤ 2, Znn ≤ 1) → 1. Combining this with (5.30) yields
∗ ∗
1 ≥ P (|Wnn − Wnn − 0| ≤ ) ≥ P (Wnn = Wnn ) ≥ P (Wnn ≤ 2, Znn ≤ 1) → 1 ,
∗ P
which shows that Wnn − Wnn −→ 0. Then
∗ ∗ P
Wnn = (Wnn − Wnn ) + Wnn −→ 0 + 1 = 1 .
174 Martingales
To be able to apply Lemma 5.5.4 to the triangular array we still need to show that the array
satisfies the unconditional Lyapounov condition (5.23). For this define
n
X n
X
∗ ∗ 3
Znn = E(|Xnk | |Fk−1 ) = E(|Xnk |3 |Fk−1 )1(Wnk ≤2,Znk ≤1)
k=1 k=1
∗
It is obvious that Znn ≤ 1 and from using that all terms in Znm are non–negative, such that
∗
Znm increases for m = 1, . . . , n we also see (like above) that Znn ≤ Znn . The assumption
P
Znn −→ 0 then implies
∗ ∗
P (|Znn − 0| > ) = P (Znn > ) ≤ P (Znn > ) → 0
∗ P ∗ ∗
so Znn −→ 0. The fact that all 0, ≤ Znn ≤ 1, makes (Znn ) uniformly integrable. For x > 1:
Z Z
∗ ∗
sup Znn dP = Znn dP = 0
n∈N ∗ ≥x)
(Znn ∅
∗ L1 ∗ ∗
So Theorem 5.4.5 gives, that Znn −→ 0. Hence E(Znn ) = E|Znn − 0| → 0, such that
n
X n
X n
X
∗ 3 ∗ 3 ∗ 3 ∗
E|Xnk | = E E(|Xnk | |Fk−1 ) = E E(|Xnk | |Fk−1 ) = E(Znn )→0
k=1 k=1 k=1
Summarising the results so far, we have shown that all conditions from Lemma 5.5.4 are
∗
satisfied for the martingale difference array (Xnm ). Then the lemma gives that
n
D
X
∗
Xnm −→ N (0, 1) .
m=1
We have already argued, that on the set (Wnn ≤ 2, Znn ≤ 1) all the indicator functions
1(Wnk ≤2,Znk ≤1) are 1. Hence
∗
Xnm = Xnm for all m = 1, . . . , n on (Wnn ≤ 2, Znn ≤ 1)
so also
n
X n
X
∗
Xnm = Xnm on (Wnn ≤ 2, Znn ≤ 1) .
m=1 m=1
so
n n
P
X X
∗
Xnm − Xnm −→ 0 .
m=1 m=1
5.6 Exercises 175
By some work it is possible to replace the third order conditions by some Lindeberg–inspired
conditions. It is sufficient that all the X–variables have second order moment, satisfy (5.28),
and fulfil
n
X P
E Xn2 m 1(|Xn m |>c) | Fm−1 → 0 as n → ∞ , (5.31)
m=1
for all c > 0 in order for the conclusion in Brown’s Theorem to be maintained.
5.6 Exercises
All random variables in the following exercises are assumed to be real valued. Exercise
5.1. Characterise the mean function n → E(Xn ) if (Xn , Fn ) is
(1) a martingale.
(2) a submartingale.
(3) a supermartingale.
Show that a submartingale is a martingale, if and only if the mean function is constant. ◦
Exercise 5.2. Let (Fn ) be a filtration on (Ω, F) and assume that τ and σ are stopping
times. Show that τ ∨ σ and τ + σ are stopping times. ◦
Exercise 5.3. Let X1 , X2 , . . . be independent and identically distributed real random vari-
ables such that EX1 = 0 and V X1 = σ 2 . Let (Fn ) be the filtration on (Ω, F) defined by
Fn = F(X1 , . . . , Xn ). Define
n
!2
X
Yn = Xk
k=1
Zn = Yn − nσ 2
176 Martingales
Exercise 5.4. Assume that (Xn , Fn ) is an adapted sequence, where each Xn is a real valued
random variable. Let A ∈ B. Define τ : Ω → N ∪ {∞} by
Exercise 5.5. Let (Fn ) be a filtration on (Ω, F). Let τ : Ω → N∪{∞} be a random variable.
Show that τ is a stopping time if and only if there exists a sequence of sets (Fn ), such that
Fn ∈ Fn for all n ∈ N and
τ (ω) = inf{n ∈ N : ω ∈ Fn }
Exercise 5.6. Let (Fn ) be a filtration on (Ω, F) and consider a sequence of sets (Fn ) where
for each n Fn ∈ Fn . Let σ be a stopping time and define
Exercise 5.7.
(1) Assume that (Xn , Fn ) is an adapted sequence. Show that (Xn , Fn ) is a martingale if
and only if
E(Xk |Fτ ) = Xτ a.s.
E(Xτ ) = E(X1 )
◦
5.6 Exercises 177
(2) Define
τ = inf{n ∈ N : Sn ≥ 1}
Exercise 5.10. Assume that X1 , X2 , . . . are independent and identically distributed random
variables with
P (Xn = 1) = p P (Xn = −1) = 1 − p ,
Sn = X1 + · · · + Xn
1−p
(1) Let r = p and show that E(Mn ) = 1 for all n ≥ 1, where
Mn = rSn .
τ = inf{n ∈ N | Sn = a or Sn = b}
(6) Show that P (τ < ∞) = 1, and realise that P (Sτ ∈ {a, b}) = 1.
Exercise 5.11. The purpose of this exercise is to show that for a random variable X with
EX 2 < ∞ and a sub σ–algebra D of F we have the following version of Jensen’s inequality
for conditional expectations
(1) Show that x2 − y 2 ≥ 2y(x − y) for all x, y ∈ R and show that for all n ∈ N it holds that
where Dn = (|E(X|D)| ≤ n). Show that both the left hand side and the right hand
side are integrable.
5.6 Exercises 179
EXτ2∧n ≤ EXn2
Exercise 5.13.(Doob’s Inequality) Assume that (Yn , Fn )n∈N is a submartingale. Let t > 0
be a given constant.
τ = inf{k ∈ N : Yk ≥ t}
Exercise 5.14.
(1) Assume that X1 , X2 , . . . are random variables with each Xn ≥ 0 and E|Xn | < ∞, such
that Xn → 0 a.s. and EXn → 0. Show that (Xn ) is uniformly integrable.
(2) Find a sequence X1 , X2 , . . . of random variables on (Ω, F, P ) = ([0, 1], B, λ) such that
Exercise 5.15. Let X be a random variable with E|X| < ∞. Let G be collection of sub
σ–algebras of F. In this exercise we shall show that the following family of random variables
(E(X|D))D∈G
is uniformly integrable.
Exercise 5.16. Assume that (Ω, F, (Fn ), P ) is a filtered probability space. Let τ be a
stopping time with Eτ < ∞. Assume that (Xn , Fn ) is a martingale.
(3) Assume that (Xτ ∧n )n∈N is uniformly integrable. Show that EXτ = EX1 .
(4) Assume that a random variable Y exists such that E|Y | < ∞ and |Xτ ∧n | ≤ |Y | a.s.
for all n ∈ N. Show that (Xτ ∧n ) is uniformly integrable.
In the rest of the exercise you can use without proof that
∞
X
|Xτ ∧n | ≤ |X1 | + 1(τ >m) |Xm+1 − Xm | (5.32)
m=1
for all n ∈ N.
E(|Xn+1 − Xn | Fn ) ≤ B a.s.
where ξ = EY1 .
182 Martingales
Let σ be a stopping time (with respect to the filtration (Gn )) such that Eσ < ∞.
Exercise 5.17. Assume that X1 , X2 , . . . are independent random variables such that for
each n it holds Xn ≥ 0 and EXn = 1. Define Fn = F(X1 , . . . , Xn ) and
n
Y
Yn = Xk
k=1
(2) Show that Y = limn→∞ Yn exists almost surely with E|Y | < ∞.
Exercise 5.18. Let X1 , X2 , . . . be a sequence of real random variables with E|Xn | < ∞ for
all n ∈ N. Assume that X is another random variable with E|X| < ∞. The goal of this
exercise is to show
L1 P
Xn −→ X if and only if E|Xn | → E|X| and Xn −→ X
5.6 Exercises 183
L1 P
(1) Assume that Xn −→ X. Show that E|Xn | → E|X| and Xn −→ X.
Let U1 , U2 , . . . and V, V1 , V2 , . . . be two sequences of random variables such that E|V | < ∞
and for all n ∈ N
E|Vn | < ∞
|Un | ≤ Vn
a.s.
Vn −→ V as n → ∞
EVn → EV as n → ∞
(2) Apply Fatou’s lemma on the sequence (Vn − |Un |) to show that
lim inf bn = b
n→∞
and
L1
(3) Use (2) to show that if E|Xn | → E|X| and Xn ∩ X then Xn −→ X.
P L1
(4) Assume that E|Xn | → E|X| and Xn −→ X. Show that Xn −→ X.
Now let (Yn , Fn ) be a martingale. Assume that a random variable Y exists with E|Y | < ∞,
P
such that Yn −→ Y .
a.s.
(5) Assume that E|Yn | → E|Y |. Show that Yn −→ Y .
(6) Show that Y closes the martingale if and only if E|Yn | → E|Y |.
184 Martingales
Exercise 5.19. Consider the gambling strategy discussed in Section 5.1: Let Y1 , Y2 , . . . be
independent and identically distributed random variables with
where 0 < p < 21 . We think of Yn as the the result of a game, where the probability of
winning is p, and where if you bet 1 dollar, you will receive 1 dollar if you win, and lose the 1
dollar, if you lose the game. We consider the sequence of strategies where the bet is doubled
for each lost game, and when a game finally is won, the bet is reset to 1. That is defining
the sequence of strategies (φn ) such that
φ1 = 1
Yn φn (Y1 , . . . , Yn−1 )
If e.g. we lose the first three games and win the fourth, then
X1 = −1, X2 = −1 − 2, X3 = −1 − 2 − 22 , X4 = −1 − 2 − 22 + 23 = 1
(1) Show that (Xn , Fn ) is a true supermartingale (meaning a supermartingale that is not
a martingale).
τ1 = inf{n ∈ N | Yn = 1}
τk+1 = inf{n > τk | Yn = 1}
5.6 Exercises 185
(3) Realise that Xτk = k for all k ∈ N and conclude that (Xτk , Fτk ) is a true submartingale.
Hence we have stopped a true supermartingale and obtained a true submartingale!! In the
next questions we shall compare that result to Theorem 5.4.9.
Pn
(4) See that on the set (τ1 > n) we must have Xn = − k=1 2k−1 = 1 − 2n and show that
Z
Xn− dP = q n − (2q)n → −∞ as n → ∞ ,
(τ1 >n)
where q = 1 − p.
Now assume that we change the strategy sequence (φn ) in such a way, that we limit our
betting in order to avoid Xn < −7. Hence we always have Xn ≥ −7. Since all bettings are
non–negative we still have, that (Xn , Fn ) is a supermartingale.
(6) Let (σk ) be an increasing sequence of stopping times. Show that (Xσk , Fσk ) is a super-
martingale.
Exercise 5.20. The purpose of this exercise is to show the following theorem:
Let (Xn ) be a martingale and assume that for some p > 1 it holds that
a.s.
(1) Assume that supn≥1 E|Xn |p < ∞. Show that there exists X such that Xn −→ X and
E|X| < ∞.
186 Martingales
(2) Assume that both supn≥1 E|Xn |p < ∞ and E supn |Xn |p < ∞. Show that E|X|p <
Lp
∞ (with X from 1)) and Xn → X.
In the rest of the exercise we shall show that E supn |Xn |p < ∞ under the assumption that
supn≥1 E|Xn |p < ∞.
Define
Sn = X1 + · · · + Xn
τa,b = inf{n ∈ N : Sn = a or Sn = b}
Defining S0 = 0.
(3) Show that P (τa,b < ∞) = 1 and conclude that P (Sτa,b ∈ {a, b}) = 1.
(4) Show that ESτa,b ∧n = 0 for all n ∈ N and conclude that ESτa,b = 0.
T∞
(7) Show that P (F ) = 1, where F = n=1 (τ−n,b < ∞).
P ( inf Sn = −∞) = 1
n≥1
Exercise 5.22. Let (Ω, F, Fn , P ) be a filtered probability space, and let (Yn )n≥1 and (Zn )n≥1
be two adapted sequences of real random variables. Define furthermore Z0 ≡ 1. Assume that
Y1 , Y2 , . . . are independent and identically distributed with E|Y1 |3 < ∞ and EY1 = 0. Assume
furthermore that for all n ≥ 2 it holds that Yn is independent of Fn−1 . Finally, assume that
E|Zn |3 < ∞ for all n ∈ N. Define for all n ∈ N
n
X
Mn = Zm−1 Ym
m=1
(6) Show that N∞ = limn→∞ Nn exists almost surely and in L1 . Find EN∞ .
EYi−1 Yi Yj−1 Yj = 0
with the definition Y0 ≡ 1. In the following questions you can use Kronecker’s Lemma that is
Pn
a mathematical result: If (xn ) is a sequence of real numbers such that limn→∞ k=1 xk = s
exists, and if 0 < b1 ≤ b2 ≤ · · · with bn → ∞, then
n
1 X
lim bk x k = 0 .
n→∞ bn
k=1
and
n
1 X a.s.
|Yk |3 −→ 0 a.s.
n3/2 k=1
190 Martingales
◦
Chapter 6
The first attempt to define the stochastic process which is now known as the Brownian motion
was made by the Frenchman Bachelier, who at the end of the 19th century tried to give a
statistical description of the random price fluctuations on the stock exchange in Paris. Some
years later, a variation of the Brownian motion is mentioned in Einstein’s theory of relativity,
but the first precise mathematical definition is due to Norbert Wiener (1923) (which explains
the name one occasionally sees: the Wiener process). The Frenchman Paul Lévy explored
and discovered some of the fundamental properties of Brownian motion and since that time
thousands of research papers have been written concerning what is unquestionably the most
important of all stochastic processes.
Brown himself has only contributed his name to the theory of the process: he was a botanist
and in 1828 observed the seemingly random motion of flower pollen suspended in water,
where the pollen grains constantly changed direction, a phenomenon he explained as being
caused by the collision of the microscopic pollen grains with water molecules.
So far the largest collection of random variables under study have been sequences indexed by
N. In this chapter we study stochastic processes indexed by [0, ∞). In Section 6.1 we discuss
how to define such processes indexed by [0, ∞), and we furthermore define and show the
existence of the important Brownian motion. In following sections we study the behaviour
of the so–called sample paths of the Brownian motion. In Section 6.2 we prove that there
exists a continuous version, and in the remaining sections we study how well–behaved the
sample paths are – apart from being continuous.
192 The Brownian motion
We begin with a brief presentation of some definitions and results from the general theory of
stochastic processes.
Definition 6.1.1. A stochastic process in continuous time is a family X = (Xt )t≥0 of real
random variables, defined on a probability space (Ω, F, P ).
In Section 2.3 we regarded a sequence (Xn )n≥1 of real random variables as a random variable
with values in R∞ equipped with the σ–algebra B∞ . Similarly we will regard a stochastic
process X in continuous time as having values in the space R[0,∞) consisting of all functions
x : [0, ∞) → R. The next step is to equip R[0,∞) with a σ–algebra. For this, define the
coordinate projections X̂t by
Definition 6.1.2. Let B[0,∞) denote the smallest σ–algebra that makes X̂t (B[0,∞) − B)
measurable for all t ≥ 0.
Then we have
Proof. The proof will be identical to the proof of Lemma 2.3.3: If X is F − B[0,∞) measurable
we can use that X̂t by definition is B[0,∞) − B measurable, so the composition is F − B
measurable. Conversely, assume that X̂t ◦ X is F − B measurable for all t ≥ 0. To show that
X is F − B[0,∞) measurable, it suffices to show that X −1 (A) ∈ F for all A in the generating
system H = {X̂t−1 (B) | t ≥ 0, B ∈ B} for B[0,∞) . But for any t ≥ 0 and B ∈ B we have
X −1 (X̂t−1 (B)) = (X̂t ◦ X)−1 (B) ∈ F by our assumptions.
Lemma 6.1.4. Let X = (Xt )t≥0 be a stochastic process. Then X is F − B[0,∞) measurable.
Proof. Note that X̂t ◦ X = Xt and Xt is F − B measurable by assumption. The result follows
from Lemma 6.1.3.
6.1 Definition and existence 193
If X = (Xt )t≥0 is a stochastic process, we can consider the distribution X(P ) on (R[0,∞) , B[0,∞) .
For determining such a distribution, the following lemma will be useful.
Proof. It is immediate (but notationally heavy) to see that H is stable under finite inter-
sections. Let F = (X̂t1 , . . . , X̂tn ) ∈ Bn ∈ H and note that the vector (X̂t1 , . . . , X̂tn ) is
B[0,∞) − Bn measurable, so F ∈ B[0,∞) . Therefore H ⊆ B[0,∞) , so also σ(H) ⊆ B[0,∞) . For the
converse inclusion, note that for all t ≥ 0 and B ∈ B it holds that X̂t−1 (B) = (X̂t ∈ B) ∈ H,
so each coordinate projection must be σ(H) − B measurable. As B[0,∞) is the smallest σ–
algebra with this property, we conclude that B[0,∞) ⊆ σ(H). All together we have the desired
result B[0,∞) = σ(H).
(n)
If X is a real stochastic process with distribution P̂ then Pt1 ...tn given by (6.1) is the distribu-
(n)
tion of (Xt1 , . . . , Xtn ) and the class (Pt1 ...tn ) is called the class or family of finite-dimensional
distributions for X.
From Lemma 6.1.5 and Theorem A.2.4, it follows that a probability P̂ on (R[0,∞) , B[0,∞) )
is uniquely determined by the finite-dimensional distributions. The main result concerning
the construction of stochastic processes, Kolmogorov’s consistency theorem, gives a simple
condition for when a given class of finite-dimensional distributions is the class of finite-
dimensional distributions for one (and necessarily only one) probability on (R[0,∞) , B [0,∞) ).
must always fulfil the following consistency condition: for all n ∈ N, 0 ≤ t1 < . . . < tn+1 and
all k ,1 ≤ k ≤ n + 1, we have
(n) (n+1)
Pt1 ...tk−1 tk+1 ...tn+1 = πk (Pt1 ...tn+1 ), (6.2)
(n)
If X = (Xt )t≥0 has distribution P̂ with finite-dimensional distributions (Pt1 ...tn ), then (6.2)
merely states that the distribution of (Xt1 , . . . , Xtk−1 , Xtk+1 , . . . , Xtn+1 ), is the marginal dis-
tribution in the distribution of (Xt1 , . . . , Xtn+1 ) which is obtained by excluding Xtk .
We shall use the consistency theorem to prove the existence of a Brownian motion, that is
defined by
Definition 6.1.7. A real stochastic process X = (Xt )t≥0 defined on a probability space
(Ω, F, P ) is a Brownian motion with drift ξ ∈ R and variance σ 2 > 0, if the following three
conditions are satisfied
(1) P (X0 = 0) = 1.
(2) For all 0 ≤ s < t the increments Xt −Xs are normally distributed N ((t−s)ξ, (t−s)σ 2 )).
(3) The increments Xt1 = Xt1 − X0 , Xt2 − Xt1 , . . . , Xtn − Xtn−1 are for all n ∈ N and
0 ≤ t1 < · · · < tn mutually independent.
Theorem 6.1.9. For any ξ ∈ R and σ 2 > 0 there exists a Brownian motion with drift ξ and
variance σ 2 .
6.1 Definition and existence 195
Proof. We shall use Kolmogorov’s consistency theorem. The finite dimensional distributions
for the Brownian motion are determined by (1)–(3):
Let 0 ≤ t1 < · · · < tn+1 . Then we know that
We have shown that the finite-dimensional distributions of a Brownian motion with drift ξ
and variance σ 2 are given by
t1 σ 2 t1 σ 2 t1 σ 2 t1 σ 2
t1 ξ ···
t ξ
2 t1 σ 2 t2 σ 2 t2 σ 2 ··· t2 σ 2
(n+1) t ξ t1 σ 2 t2 σ 2 t3 σ 2 ··· t3 σ 2
Pt1 ,...,tn+1 = N 3
,
.. .. .. .. ..
. . . . .
tn+1 ξ t1 σ 2 t2 σ 2 t3 σ 2 ··· tn+1 σ 2
(n+1)
Finding πk (Pt1 ...tn+1 ) (cf. (6.2)) is now simple: the result is an n-dimensional normal dis-
tribution, where the mean vector is obtained by deleting the k’th entry in the mean vector
(n+1)
for Pt1 ...tn+1 , and the covariance matrix is obtained by deleting the k’th row and the k’th
(n+1)
column of the covariance matrix for Pt1 ...tn+1 . It is immediately seen that we thus ob-
(n)
tain Pt1 ...tk−1 tk+1 ...tn+1 , so by the consistency theorem there is exactly one probability P̂ on
(R[0,∞) , B [0,∞) ) with finite-dimensional distributions given by the normal distribution above.
With this probability measure P̂ , the process consisting of all the coordinate projections
X̂ = (X̂t )t≥0 becomes a Brownian motion with drift ξ and variance σ 2 .
196 The Brownian motion
Proof. We will show that the two processes have the same finite–dimensional distributions.
So let 0 ≤ t1 < · · · < tn . Then we show
D
(Xt1 , . . . , Xtn ) = (Xt1 +u − Xu , Xt2 +u − Xu , . . . , Xtn +u − Xu ) . (6.5)
(Xt1 , . . . , Xtn )
where the coordinates are independent and normally distributed. In the exact same way we
can see that
(Xu+t1 − Xu , . . . , Xu+tn − Xu )
is n–dimensional normally distributed, since it is a linear transformation of
that have independent and normally distributed coordinates. So both of the vectors in
(6.5) are normally distributed. To see that the two vectors have the same mean vector and
covariance matrix, it suffices to show that for 0 ≤ s
EXs = E(Xu+s − Xu )
We obtain
E(Xu+s − Xu ) = EXu+s − EXu = ξ(u + s) − ξu = ξs = EXs
and
Cov(Xu+s1 − Xu , Xu+s2 − Xu )
=Cov(Xu+s1 − Xu , Xu+s1 − Xu + Xu+s2 − Xu+s1 )
=Cov(Xu+s1 − Xu , Xu+s1 − Xu ) + Cov(Xu+s1 − Xu , Xu+s2 − Xu+s1 )
=V (Xu+s1 − Xu ) = σ 2 s1 = Cov(Xs1 , Xs2 )
6.2 Continuity of the Brownian motion 197
In the previous section we saw how it is possible using Kolmogorov’s consistency theorem
to construct probabilities on the function space (R[0,∞) , B[0,∞) ). Thereby we also obtained a
construction of stochastic processes X = (Xt )t≥0 with given finite-dimensional distributions.
However, if one aims to construct processes (Xt ), which are well-behaved when viewed as
functions of t, the function space (R[0,∞) , B[0,∞) ) is much too large, as we shall presently see.
Let X = (Xt )t≥0 be a real process, defined on (Ω, F, P ). The sample paths of the process
are those elements
t → Xt (ω)
in R[0,∞) which are obtained by letting ω vary in Ω. One might then be interested in
determining whether (almost all) the sample paths are continuous, i.e., whether
P (X ∈ C[0,∞) ) = 1 ,
where C[0,∞) ⊆ R[0,∞) is the set of continuous x : [0, ∞) → R. The problem is, that C[0,∞)
is not in B[0,∞) !
We will show this by finding two B[0,∞) –measurable processes X and Y defined on the same
(Ω, F, P ) and with the same finite dimensional distributions, but such that all sample paths
for X are continuous, and all sample paths for Y are discontinuous in all t ≥ 0. The processes
X and Y are constructed in Example 6.2.1. The existence of such processes X and Y gives
that
(X ∈ C[0,∞) ) = Ω (Y ∈ C[0,∞) ) = ∅
P (X ∈ C[0,∞) ) = P (Y ∈ C[0,∞) ) ,
which is a contradiction!
Example 6.2.1. Let U be defined on (Ω, F, P ) and assume that U has the uniform distri-
bution on [0, 1].
Define
Xt (ω) = 0 for all ω ∈ Ω, t ≥ 0
and (
0, if U (ω) − t is irrational
Yt (ω) = .
1, if U (ω) − t is rational
198 The Brownian motion
P (Xt1 = · · · = Xtn = 0) = 1
P (Yt = 1) = P (U − t ∈ Q) = 0
P (Yt1 = · · · = Ytn = 0) = 1
This shows that X and Y have the same finite dimensional distributions. ◦
Thus constructing a continuous process will take more than distributional arguments. In
the following we discuss a concrete approach that leads to the construction of a continuous
Brownian motion.
Definition 6.2.2. If the processes X = (Xt )t≥0 and Y = (Yt )t≥0 are both defined on
(Ω, F, P ), then we say that Y is a version of X if
P (Yt = Xt ) = 1
for all t ≥ 0.
(Yt = Xt ) = (Yt = 0)
Proof. The idea is to show that Y and X have the same finite–dimensional distributions:
With t1 < · · · < tn we have P (Ytk = Xtk ) = 1 for k = 1, . . . , n. Then also
n
\
P ((Yt1 , . . . , Ytn ) = (Xt1 , . . . , Xtn )) = P (Ytk = Xtk ) = 1
k=1
6.2 Continuity of the Brownian motion 199
The aim is to show that there exists a continuous version of the Brownian motion. Define
for n ∈ N
Cn◦ = {x ∈ R[0,∞) : x is uniformly continuous on [0, n] ∩ Q}
and
∞
\
◦
C∞ = Cn◦
n=1
◦
Lemma 6.2.5. If x ∈ C∞ then there exists a uniquely determined continuous function
[0,∞)
y∈R such that yq = xq for all q ∈ [0, ∞) ∩ Q.
◦
Proof. Let x ∈ C∞ and t ≥ 0. Then choose n such that n > t. We have that x ∈ Cn◦ , so x is
uniformly continuous on [0, n] ∩ Q. That means
Choose a sequence (qk ) ⊆ [0, n]∩Q with qk → t. Then in particular (qk ) is a Cauchy sequence.
The uniform continuity of x gives that xqk is a Cauchy sequence as well: Let > 0 and find
the corresponding δ > 0. We can find K ∈ N such that for all m, n ≥ K it holds
|qm − qn | < δ
|q̃k − qk | → 0 as k → ∞
|xq̃k − xqk | → 0 as k → ∞
For all t ∈ Q we see that yt = xt , since the continuity of x in t gives limk→∞ xqk = xt .
Finally we have, that y is continuous in all t ≥ 0: Let t ≥ 0 and > 0 be given, and find
δ > 0 according to the uniform continuity. Now choose t0 with |t0 − t| < δ/2. Assume that
qk → t and qk0 → t0 . We can find K ∈ N such that |qk0 − qk | < δ for k ≥ K. Then
|yt0 − yt | ≤ .
xt = 1[√2,∞)
Then x is continuous on [0, n] ∩ Q, but an y does not exist with the required properties.
◦
We obtain, that C∞ ∈ B[0,∞) since
∞ [
∞
\ \ n 1 o
Cn◦ = x ∈ R[0,∞) : |xq2 − xq1 | < ,
1
M
M =1 N =1 q1 ,q2 ∈[0,n]∩Q,|q1 −q2 |≤
N
Definition 6.2.6. A real process X = (Xt )t≥0 is continuous in probability if for all t ≥ 0
P
and all sequences (tk )k∈N with tk ≥ 0 and tk → t it holds that Xtk −→ Xt
Theorem 6.2.7. Let X = (Xt )t≥0 be a real process which is continuous in probability. If
◦
P (X ∈ C∞ ) = 1,
◦
Proof. Let F = (X ∈ C∞ ). Assume that ω ∈ F . According to Lemma 6.2.5 there exists a
uniquely determined continuous function t 7→ Yt (ω) such that
Furthermore we must have for each t ≥ 0 that a rational sequence (qk ) can be chosen with
qk → t. Then using the continuity of t → Yt (ω) and the property in (6.6) yields that for all
ω∈F
Yt (ω) = lim Yqk (ω) = lim Xqk (ω) .
k→∞ k→∞
6.2 Continuity of the Brownian motion 201
Yt = lim 1F Xqk .
k→∞
Since all 1F Xqk are random variables (measurable), then Yt is a random variable as well.
And since t ≥ 0 was chosen arbitrarily, then Y = (Yt )t≥0 is a continuous real process (for
ω ∈ F c we chose (Yt (ω)) to be constantly 0 – which is a continuous function) that satisfies.
We still need to show, that Y is a version of X. So let t ≥ 0 and find a rational sequence
(qk ) with qk → t. Since X is assumed to be continuous in probability we must have
P
Xqk −→ Xt
a.s.
and since we have Yqk = Xqk it holds
P
Yqk −→ Xt .
Yqk → Yt
Then
P (Yt = Xt ) = 1 .
as desired.
Theorem 6.2.8. Let X = (Xt )t≥0 be a Brownian motion with drift ξ and variance σ 2 > 0.
Then X has a continuous version.
Proof. It is sufficient to consider the normalized case, where ξ = 0 and σ 2 = 1. For a general
choice of ξ and σ 2 we have that
Xt − ξt
X̃t =
σ
is a normalized Brownian motion. And obviously, (Xt )t≥0 is continuous if and only if (X̃t )t≥0
is continuous.
So let X = (Xt )t≥0 be a normalized Brownian motion. Firstly, we show that X is continuous
in probability. For all 0 ≤ s < t we have
Xt − Xs ∼ N (0, t − s)
202 The Brownian motion
such that
1
√ (Xt − Xs ) ∼ N (0, 1)
t−s
Then for > 0 we have
1
P (|Xt − Xs | > ) = P √ |Xt − Xs | > √
t−s t−s
Z −√ Z ∞
t−s 1 2 1 2
= √ e−u /2 du + √ e−u /2 du
−∞ 2π √
t−s
2π
Z ∞
1 2
=2 √ e−u /2 du
√
t−s
2π
The following, that is actually a stronger version of the continuity in probability, will be
useful. It holds for all > 0 that
1
lim P (|Xh | > ) = 0 . (6.7)
h↓0 h
1 1 1 h 1 3h
P (|Xh | > ) = P (Xh4 > 4 ) ≤ 4 EXh4 = 4 E( √ Xh )4 = 4
h h h h
which has limit 0, as h → 0. In the last equality we used that √1 Xh is N (0, 1)–distributed
h
and that the N (0, 1)–distribution has fourth moment = 3.
P (X ∈ Cn◦ ) = 1
6.2 Continuity of the Brownian motion 203
for all n ∈ N, recalling that a countable intersection of sets with probability 1 has probability
1. We show this for n = 1 (higher values of n would not change the argument, only make
the notation more involved). Define
1
VN = sup |Xq0 − Xq | : q, q 0 ∈ Q ∩ [0, 1], |q 0 − q| ≤ N .
2
Then VN decreases as N → ∞, so
∞ [
∞
\ \ 1
(X ∈ C1◦ ) = |Xq2 − Xq1 | ≤
1
M
M =1 N =1 q1 ,q2 ∈[0,1]∩Q,|q2 −q1 |≤
N
∞ [
∞
\ 1
= VN ≤ = ( lim VN = 0)
M N →∞
M =1 N =1
where hk − 1 k i
Jk,N = , ∩ Q.
2N 2N
If we can show that
(1) VN ≤ 3 max{Yk,N | 1 ≤ k ≤ 2N }
then we obtain
N
2
[
P (VN > ) ≤ P max Yk,N > =P Yk,N >
k=1,...,2N 3 3
k=1
N
2
X
≤ P Yk,N > = 2N P Y1,N > ≤ 2N +1 P |X 1N | >
3 3 2 3
k=1
which has limit 0 as N → ∞ because of (6.7). The first inequality is according to (1), the
second inequality follows from Boole’s inequality, and the last inequality is due to (2). Hence,
the proof is complete if we can show (1) and (2).
204 The Brownian motion
For (1): Consider for some fixed N ∈ N the q, q 0 that are used in the definition of VN . Hence
q < q 0 ∈ Q ∩ [0, 1] where |q 0 − q| ≤ 21N . We have two possibilities:
≤ 2Yk,N
≤ 2 max{Yk,N | 1 ≤ k ≤ 2N } ,
≤ Yk,N + 2Yk−1,N
≤ 3 max{Yk,N | 1 ≤ k ≤ 2N } .
where the right hand side does not depend on q, q 0 . Property (1) follows from taking the
supremum.
For (2): Note that for all k = 2, . . . , 2N , the variable Yk,N is calculated from the process
in the exact same way as Y1,N is calculated from the process (Xs )s≥0 .
Also note that because of the Lemma 6.1.10 the two processes
D
have the same distributions. Then also Yk,N = Y1,N for all k = 2, . . . , 2N , such that in
particular
P (Yk,N > y) = P (Y1,N > y) ≤ 2P (|X 1N | > y)
2
for all y > 0. The inequality comes from Lemma 6.2.9 below, since J1,N is countable with
J1,N ⊆ [0, 21N ].
6.2 Continuity of the Brownian motion 205
Lemma 6.2.9. Let X = (Xt )t≥0 be a normalized Brownian Motion and let D ⊆ [0, t0 ] be an
at most countable set. Then it holds for x > 0 that
Proof. First assume that D is finite such that D = {t1 , . . . , tn }, where 0 ≤ t1 < · · · < tn ≤ t0 .
Define
τ = min{k ∈ {1, . . . , n} | Xtk > x}
n−1
X
P (sup Xt > x) = P (τ = k) + P (τ = n, Xtn > x)
t∈D
k=1
Let k ≤ n − 1. Then
k−1
\
(τ = k) = (Xtj ≤ x) ∩ (Xtk > x)
j=1
(τ = k) ⊥
⊥ (Xtn − Xtk ) .
n−1
X
P (sup Xt > x) = P (τ = k) + P (τ = n, Xtn > x)
t∈D
k=1
n−1
X
≤2 P (τ = k, Xtn > x) + 2P (τ = n, Xtn > x)
k=1
In the last inequality it is used that tn ≤ t0 such that Xt0 have a larger variance than Xtn
(both variables have mean 0).
206 The Brownian motion
Thereby we have shown the first result in the case, where D is finite. To obtain the second
result for a finite D, consider the process −X = (−Xt )t≥0 , which is again a normalised
Brownian motion. Hence for x > 0 we have
P ( inf Xt < −x) = P (sup(−Xt ) > x) ≤ 2P (−Xt0 > x) = 2P (Xt0 < −x)
t∈D t∈D
so we can obtain
P (sup |Xt | > x) = P (sup Xt > x) ∪ ( inf Xt < −x)
t∈D t∈D t∈D
For a general D find a sequence (Dn ) of finite subsets of D where Dn ↑ D. Then the two
inequalities holds for each Dn . Since furthermore
In this and the subsequent section we study the sample paths of a continuous Brownian
motion. In this framework it will be useful to consider the space C[0,∞) consisting of all
6.3 Variation and quadratic variation 207
functions x ∈ R[0,∞) that are continuous. Like the projections on R[0,∞) we let X̃t denote
the coordinate projections on C[0,∞) , that is X̃t (x) = xt for all x ∈ C[0,∞) . Let C[0,∞) denote
the smallest σ–algebra, that makes all X̃t C[0,∞) − B–measurable
C[0,∞) = σ(X̃t | t ≥ 0) .
Similarly to what we have seen previously, C[0,∞) is generated by all the finite–dimensional
cylinder sets
C[0,∞) = σ (X̃t1 , . . . , X̃tn ) ∈ Bn | n ∈ N, 0 < t1 < · · · < tn , Bn ∈ Bn .
We demonstrated in Section 6.2 that there exists a process X defined on (Ω, F, P ) with
values in (R[0,∞) , B[0,∞) ) such that X is a Brownian motion X = (Xt ) and the sample paths
t 7→ Xt (ω) are continuous for all ω ∈ Ω. Equivalently, we have X(ω) ∈ C[0,∞) for all ω ∈ Ω,
so we can regard the process X as having values in C[0,∞) . That X is measurable with values
in (R[0,∞) , B[0,∞) ) means that X̂t (X) is F − B measurable for all t ≥ 0. But X̃t (X) = X̂t (X)
since X is continuous, so X̃t (X) is also F −B measurable for all t ≥ 0. Then X is measurable,
when regarded as a variable with values in (C[0,∞) , C[0,∞) ). The distribution X(P ) of X will
be a distribution on (C[0,∞) , C[0,∞) ), and this is uniquely determined by the behaviour on the
finite–dimensional cylinder sets on the form (X̃t1 , . . . , X̃tn ) ∈ Bn .
The space (C[0,∞) , C[0,∞) ) is significantly easier to deal with than (R∞ , B[0,∞) ), and a number
of interesting functionals become measurable on C[0,∞) , while they are not measurable on
R[0,∞) . For instance, for t > 0 we have that
M̃ = sup X̃s
s∈[0,t]
where the last intersection is countable – hence measurable. For the last equality, the inclusion
’⊆’ is trivial. For the converse inclusion, assume that
\
x∈ (X̃q ≤ y) .
q∈[0,t]∩Q
Then xq ≤ y for all q ∈ [0, t] ∩ Q. Let s ∈ [0, t] and find a rational sequence qn → s. Then
xs = limn→∞ xqn ≤ y and since s was arbitrarily chosen, it holds that
\
x∈ (X̃s ≤ y) .
s∈[0,t]
We will define various concepts that can be used to describe the behaviour of the sample
paths of a process.
208 The Brownian motion
Definition 6.3.1. Let x ∈ C[0,∞) . We say that x is nowhere monotone if it for all 0 ≤ s < t
holds that x is neither increasing nor decreasing on [s, t]. Let S ⊆ C[0,∞) denote the set of
nowhere monotone functions.
Related to the set S we define Mst to be the set of functions, which are either increasing or
decreasing on the interval [s, t]:
∞
\
Mst = {x ∈ C[0,∞) | xtkN ≥ xtk−1,N , 1 ≤ k ≤ 2N }
N =1
∞
\
∪ {x ∈ C[0,∞) | xtkN ≤ xtk−1,N , 1 ≤ k ≤ 2N } ,
N =1
k
where tkN = s + 2N
(t − s) for 0 ≤ k ≤ 2N . We note that Mst ∈ C[0,∞) , since e.g.
Since x ∈ S c if and only if there exists intervals with rational endpoints where x is monotone,
then we can write
[
Sc = Mq1 q2 ,
0≤q1 <q2
q1 ,q2 ∈Q
which shows that S ∈ C[0,∞) . We shall see later that P (X ∈ S) = 1 for a continuous Brownian
motion X.
Definition 6.3.2. Let x ∈ C[0,∞) and 0 ≤ s < t. The variation of x on [s, t] is defined as
n
X
Vst (x) = sup |xtk − xtk−1 | ,
k=1
where sup is taken over all finite partitions s ≤ t0 < · · · < tn ≤ t of [s, t].
Lemma 6.3.3. Let x, y ∈ C[0,∞) , c ∈ R, 0 ≤ s < t and [s, t] ⊆ [s0 , t0 ]. Then it holds that
Proof. The first statement is because the sup in Vs0 t0 (x) is over more partitions than the sup
in Vst (x). For the second result we have
n
X n
X
Vst (cx) = sup |cxtk − cxtk−1 | = |c| sup |xtk − xtk−1 | = |c|Vst (x) ,
k=1 k=1
Vst = |xt − xs |
Let x ∈ C[0,∞) and assume that s ≤ tk−1 < tk ≤ t are given. If (qn ), (rn ) ⊆ [s, t] are rational
sequences with qn → tk−1 and rn → tk , it holds due to the continuity of x that
This shows that all partitions can be approximated arbitrarily well by rational partitions,
so the sup in the definition of Vst needs only to be over all rational partitions. Hence Vst is
C[0,∞) − B measurable.
210 The Brownian motion
Definition 6.3.5. (1) Let x ∈ C[0,∞) and 0 ≤ s < t. Then x is of bounded variation on
[s, t] if Vst (x) < ∞. The set of functions of bounded variation on [s, t] is denoted
Since Vst is C[0,∞) − B measurable we observe that Fst ∈ C[0,∞) . Furthermore we can rewrite
G as \
G= Fqc1 ,q2 ,
0≤q1 <q2
q1 ,q2 ∈Q
which shows that G ∈ C[0,∞) . The equality above is a direct consequence of (1) in Lemma
6.3.3.
The following lemmas shows which type of continuous functions have bounded variation.
Lemma 6.3.6. Let x ∈ C[0,∞) . Then x ∈ Fst if and only if x on [s, t] has the form
x = y − ỹ ,
Proof. If x has the form x = y − ỹ on [s, t], where both y and ỹ are increasing, then using
Lemma 6.3.3 yields
Vst (x) = Vst (y − ỹ) ≤ Vst (y) + Vst (−ỹ) = Vst (y) + Vst (ỹ) = |yt − ys | + |ỹt − ỹs |
which is finite. Conversely, assume that Vst (x) < ∞ and define for u ∈ [s, t]
1 1
yu = (xu + Vsu (x)) and ỹu = (−xu + Vsu (x)) .
2 2
Then x = y − ỹ and furthermore we have, that e.g. u → yu is increasing: If xu+h ≥ xu , then
yu+h ≥ yu since always Vs,u+h (x) ≥ Vs,u (x). If xu+h < xu , then
0
n
X
Vs,u (x) + |xu+h − xu | = sup
˜ |xsj − xsj−1 | + |xu+h − xu |
j=1
Xn
≤ sup |xtk − xtk−1 | = Vs,u+h (x) ,
k=1
6.3 Variation and quadratic variation 211
˜ is over all partitions s ≤ s0 < · · · < sn0 ≤ u and sup is over all partitions
where sup
s ≤ t0 < · · · < tn ≤ u + h. Hence we have seen that
1 1
yu = xu+h + |xu+h − xu | + Vsu (x) ≤ xu+h + Vs,u+h (x) = yu+h .
2 2
Proof. Assume that x ∈ S c . Then there exists s < t such that x is monotone on [s, t]. But
then x has the form x − 0 on [s, t], where both x and 0 are increasing. Thus x ∈ Fst , so
x ∈ Gc .
Proof. The derivative x0 of x is continuous on [s, t], so x0 must be bounded on [s, t]. Let
K = supu∈[s,t] |x0 (u)| and consider an arbitrary partition s ≤ t0 < · · · < tn ≤ t. Then
n
X n
X
|xtk − xtk−1 | = (tk − tk−1 )|x0uk | ≤ (t − s)K < ∞ ,
k=1 k=1
Definition 6.3.9. Let x ∈ C[0,∞) . The quadratic variation of x ∈ C[0,∞) on [s, t] is defined
as X
Qst (x) = lim sup (x kN − x k−1 )2 .
N →∞ 2 2N
k∈N
s≤ k−1 < kN ≤t
2N 2
if [s, t] ⊆ [s0 , t0 ].
or equivalently
Qst (x) > 0 ⇒ Vst (x) = ∞ .
212 The Brownian motion
The main result of the section is the following theorem which describes exactly how ”wild”
the sample paths of the Brownian motion behaves.
Theorem 6.3.11. If X = (Xt )t≥0 is a continuous Brownian motion with drift ξ and variance
σ 2 , then \
P (Qst (X) = (t − s)σ 2 ) = 1 .
0≤s<t
Corollary 6.3.12. If X = (Xt )t≥0 is a continuous Brownian motion with drift ξ and vari-
ance σ 2 , then X is everywhere of unbounded variation,
P (X ∈ G) = 1 .
Corollary 6.3.13. If X = (Xt )t≥0 is a continuous Brownian motion with drift ξ and vari-
ance σ 2 , then X is nowhere monotone,
P (X ∈ S) = 1 .
The inclusion ⊆ is trivial. The converse inclusion ⊇ is argued as follows: Assume that x
is an element of the right hand side and let 0 ≤ s < t be given. Then we are supposed to
show that Qst (x) = (t − s)σ 2 . Let (qn1 ), (qn2 ), (rn1 ), (rn2 ) be rational sequences such that qn1 ↑ s,
qn2 ↓ t, rn1 ↓ s, and rn2 ↑ t. Then for all n ∈ N we must have [rn1 , rn2 ] ⊆ [s, t] ⊆ [qn1 , qn2 ] such
that because of (6.8) it holds that
Qrn1 ,rn2 (x) = (rn2 − rn1 )σ 2 and Qqn1 ,qn2 (x) = (qn2 − qn1 )σ 2
which combined with the inequality above gives the desired result that Qst (x) = (t − s)σ 2 .
Since the intersection above is countable, we only need to conclude that each of the sets in
the intersection has probability 1 in order to conclude the result. Hence it will suffice to show
that
P (Qst (X) = (t − s)σ 2 )
for given 0 ≤ s < t (we only need to show it for s, t ∈ Q, but that makes no difference in the
rest of the proof). Furthermore, we show the result for s = 0. The general result can be seen
in the exact same way, using even more notation.
so each Uk,n ∼ N (0, 1). Furthermore for fixed n and varying k, the increments are indepen-
dent, such that also U1,n , U2,n , . . . are independent. We can write
[2n t]
X
Q0t (X) = lim sup (X k − X k−1 )2
2n n
2
n→∞
k=1
[2n t] r
X σ2 1 2
= lim sup U k,n + ξ
n→∞ 2n 2n
k=1
n
[2 t] 2 r
X σ σ 2 1 1
= lim sup U2 + 2 Uk,n ξ + n ξ 2
n→∞ 2n k,n 2n 2n 4
k=1
n
[2 t] 2 n
Xσ 2 [2 t]
U2 + √ ξUk,n + n ξ 2 .
= lim sup
n→∞ 2n k,n σ 2 2n 4
k=1
Which gives
Q0t (X) − tσ 2
n
[2 t] 2 n
Xσ 2 [2 t]
U2 + √ ξUk,n − tσ 2 + n ξ 2
= lim sup
n→∞ 2n k,n σ 2 2n 4
k=1
[2n t] [2n t] [2n t]
X 1 2
= lim sup σ 2 U2 + √ ξUk,n − 1 + σ 2 − t + n ξ2
n→∞ 2n k,n σ 2 2n 2n 4
k=1
We have that
1 2n t − 1 [2n t] 2n t
t− = < ≤ =t
2n 2n 2n 2n
which shows that
[2n t] [2n t] 1 [2n t]
→ t and = →0
2n 4n 2n 2n
as n → ∞ So for deterministic part Q0t (X) − tσ 2 it holds
[2n t] [2n t] 2
σ2 ( − t) + ξ →0
2n 4n
a.s.
as n → ∞. Then the proof will be complete, if we can show that Sn −→ 0 where
[2n t]
X 1 2
2
Sn = n
Uk,n +√ ξUk,n − 1 .
2 2
σ 2n
k=1
Note that
[2n t]
X 1 2
2
ESn = EUk,n +√ ξEUk,n − 1 = 0
2n 2
σ 2n
k=1
6.4 The law of the iterated logarithm 215
2
since EUk,n = 1. According to Lemma 1.2.12 the convergence of Sn is obtained, if we can
show
∞
X
P (|Sn | > ) < ∞ (6.9)
n=1
for all > 0. Each term in this sum satisfies (using Chebychev’s Inequality)
1
P (|Sn | > ) = P (|Sn − ESn | > ) ≤ V (Sn )
2
such that the sum in (6.9) converges, if we can show
∞
X
V (Sn ) < ∞ (6.10)
n=1
Using that U1,n , U2,n are independent and identically distributed gives
[2n t]
X 1 2
2
V (Sn ) = V Uk,n +√ ξUk,n − 1
4n 2
σ 2 n
k=1
1 2
=[2n t] n V U1,n
2
+√ ξU1,n − 1
4 σ 2 2n
[2n t] 2 2 2
= n E U1,n +√ ξU1,n − 1
4 σ 2 2n
[2n t] 4 2 3 2 2 3 4
= n
EU1,n +√ ξEU1,n − EU1,n +√ ξEU1,n + 2 n ξ 2 EU1,n
2
4 2
σ 2 n 2
σ 2 n σ 2
!
2 2 2
+√ ξEU1,n − EU1,n − √ ξEU1,n + 1
σ 2 2n σ 2 2n
2 3 4
and since EU1,n = 0, EU1,n = 1, EU1,n = 0, and EU1,n = 3 we have
[2n t] 4ξ 2 [2n t] 4ξ 2
V (Sn ) = n
3−1+ 2 n −1+1 = n 2+ 2 n
4 σ 2 4 σ 2
2n t 4ξ 2 1 4ξ 2
≤ n 2 + 2 = nt 2 + 2
4 σ 2 2 σ 2
from which it is seen that the sum in (6.10) is finite.
In this section we shall show a classical result concerning Brownian motion which in more
detail describes the behaviour of the sample paths immediately after the start of the process.
216 The Brownian motion
1 1
lim sup Xt , lim inf Xt (6.11)
t→0 h(t) t→0 h(t)
are both something interesting, i.e., finite and different from 0. A good guess for a sensible
h can be obtained by considering a Brownian motion without drift (ξ = 0), and using that
then
1
√ Xt
t
has the same distribution for all t > 0 which could be taken as an indication that √1t Xt
√
behaves sensibly for t → 0. But alas, h(t) = t is too small, although not much of an
adjustment is needed before (6.11) yields something interesting.
1 log log t
lim t log log = lim =0
t→0 t t→∞ t
we have limt→0 φ(t) = 0 so it makes sense to define φ(0) = 0. Then φ is defined, non-negative
and continuous on [0, 1e ). We shall also need the useful limit
R∞ 2
x
e−u /2 du
lim 1 −x2 /2 =1 (6.13)
xe
x→∞
which follows from the following inequalities, that all hold for x > 0: since ux ≥ 1 for u ≥ x
we have Z ∞ Z ∞
2 u −u2 /2 1 2 1 −x2 /2
e−u /2 du ≤ e du = [−e−u /2 ]∞ x = e
x x x x x
u
and since x+1 ≤ 1 for x ≤ u ≤ x + 1 we have
Z ∞ Z x+1
2 u −u2 /2
e−u /2
du ≥ e du
x x x+1
1 2 2
= (e−x /2 − e−(x+1) /2 )
x+1
1 2 x
= e−x /2 (1 − e−x−1/2 ),
x x+1
6.4 The law of the iterated logarithm 217
Theorem 6.4.1 (The law of the iterated logarithm). For a continuous Brownian motion
X = (Xt )t≥0 with drift ξ and variance σ 2 > 0, it holds that
Xt Xt
P (lim sup √ = 1) = P (lim inf √ = −1) = 1 ,
t→0 σ 2 φ(t) t→0 σ 2 φ(t)
where φ is given by (6.12).
Proof. We show the theorem for X a continuous, normalised Brownian motion. Since
ξt
lim = 0,
t→0 φ(t)
and similarly for lim inf. Since (σXt + ξt)t≥0 has drift ξ and variance σ 2 , the theorem follows
for an arbitrary Brownian motion.
Since X is continuous, the union can be replaced by a countable union so Cn,,u is measurable.
For a given > 0 and 0 < u < 1 it is seen that
(Cn,,u i.o.) = ∀n0 ≥ 1 ∃n ≥ n0 ∃t ∈ [tn+1 , tn ] : Xt > (1 + )φ(t)
Xt
= ∀n ≥ 1 ∃t ≤ tn : Xt > (1 + )φ(t) = lim sup >1+
t→0 φ(t)
so it is thus clear that (6.14) follows if there for all > 0 exists a u, 0 < u < 1, such that
P (Cn,,u i.o.) = 0
and to deduce this, it is by the Borel-Cantelli Lemma (Lemma 1.2.11) sufficient that
X
P (Cn,,u ) < ∞. (6.16)
n
(Note that Cn,,u is only defined for n so large that tn = un ∈ [0, 1e ), the interval where φ is
defined. In all computations we of course only consider such n)
Since the function φ is continous on [0, 1e ) with φ(0) = 0 and φ(t) > 0 for t > 0, there exists
0 < δ0 < 1e such that φ is increasing on the interval [0, δ0 ]. Therefore it holds for n large (so
large that tn ≤ δ0 ) that
[
P (Cn,,u ) ≤ P (Xt > (1 + )φ(tn+1 )) = P sup Xt > (1 + )φ(tn+1 )
t:tn+1 ≤t≤tn
t:tn+1 ≤t≤tn
≤ P sup Xt > (1 + )φ(tn+1 ) ≤ 2P (Xtn > (1 + )φ(tn+1 )) ,
t:t≤tn
where we in the last inequality have used Lemma 6.2.9 and the continuity of X (which implies
that sup Xt = sup Xq ). Since √1tn Xtn is N (0, 1)-distributed it follows that
t≤tn q∈Q∩[0,tn ]
∞
r Z
2 2
P (Cn,,u ) ≤ e−s /2
ds,
π xn
where r
1 1
xn = (1 + ) √ φ(tn+1 ) = (1 + ) 2u log((n + 1) log ).
tn u
We see that xn → ∞ for n → ∞ and hence it holds by (6.13) that
R ∞ −s2 /2
xn
e ds
1 −x2n /2 →1.
xn e
In particular we have R∞ 2
xn
e−s /2
ds
2 →0
e−xn /2
6.4 The law of the iterated logarithm 219
To show (6.15), let tn = v n , where 0 < v < 1, put Zn = Xtn − Xtn+1 and define
Dn,,v = Zn > (1 − )φ(tn ) .
2
Note that the events Dn,,v for fixed and v and varying n are mutually independent.
We shall show that, given > 0, there exists a v, 0 < v < 1, such that
and we claim that this implies (6.15): if we apply (6.14) to −X, we get
Xt
P lim inf ≥ −1 = 1,
t→0 φ(t)
Xt (ω)
lim inf ≥ −1
t→0 φ(t)
and that there exists a subsequence (n0 ), n0 → ∞ of natural numbers (depending on ω) such
that for all n0
ω ∈ Dn0 ,,v .
But then, for all n0 ,
and furthermore
Xtn0 +1 (ω) Xt (ω)
lim inf ≥ lim inf ,
0
n →∞ φ(tn0 +1 ) t→0 φ(t)
we see that
Xt (ω) Xt 0 (ω) Xt 0 (ω) √
lim sup ≥ lim sup n ≥ 1 − + lim inf n +1 v
t→0 φ(t) 0
n →∞ φ(tn )
0 2 n 0 →∞ φ(tn0 +1 )
√
≥ 1 − − v ≥ 1 − ,
2
√
if v for given > 0 is chosen so small that v < 2 . Hence we have shown that (6.17) implies
(6.15).
We still need to show (6.17). Since the Dn,,v ’s for varying n are independent, (6.17) follows
from second version of the Borel-Cantelli lemma (Lemma 1.3.12) by showing that
X
P (Dn,,v ) = ∞. (6.18)
n
We conclude the proof by showing that this may be achieved by choosing v > 0 sufficiently
small for any given > 0 (this was already needed to conclude that (6.17) implies (6.15)).
But since Zn is N (0, tn − tn+1 )–distributed, we obtain
Z ∞
1 2
P (Dn,,v ) = √ e−s /2 ds,
2π yn
where r
φ(tn ) 2 1
yn = 1 − √ = (1 − ) log n log .
2 tn − tn+1 2 1−v v
yn
e−s /2 ds
1 −yn2 /2 → 1,
yn e
and since
q s
1
y log n log v log n + log log 1
√ n = const. · √ = const. · v
→ const. > 0 ,
log n log n log n
the proof is now finished by realising that for given > 0 we have
X 1 2
√ e−yn /2 = ∞ (6.19)
log n
6.4 The law of the iterated logarithm 221
As an immediate consequence of Theorem 6.4.1 we obtain the following result concerning the
number of points where the Brownian motion is zero.
Corollary 6.4.2. If X is a continuous Brownian motion, it holds for almost all ω that for
all > 0, Xt (ω) = 0 for infinitely many values of t ∈ [0, ].
so for almost all ω there exists, for each rational q > 0, an open interval around q where
t → Xt (ω) does not take the value 0. In some sense therefore, Xt (ω) is only rarely 0 but it
still happens an infinite number of times close to 0.
Proof. Theorem 6.4.1 implies that for almost every ω there exist sequences 0 < sn ↓ 0 and
0 < tn ↓ 0 such that
1 √ 1 √
Xsn (ω) > φ(sn ) σ 2 > 0, Xtn (ω) < − φ(tn ) σ 2 < 0
2 2
for all n. Since X is continuous, the corollary follows.
Xt
P lim inf √ = −1 = 1 .
t→∞ 2t log log t
Then t 7→ Yt is continuous on the open interval (0, ∞) and for arbitrary n and 0 < t1 <
· · · < tn it is clear that (Yt1 , . . . , Ytn ) follows an n-dimensional normal distribution. Since
Y0 = X0 = 0 and we for 0 < s < t have EYt = 0 while (recall the finite–dimensional
distributions of the Brownian motion)
and with Y continuous on (0, ∞) we see that Y becomes continuous on [0, ∞). But then the
continuous process Y has the same distribution as X, and thus, Y is a continuous, normalized
Brownian motion. Theorem 6.4.1 applied to Y then shows us that for instance
sX 1s
lim sup q = 1 a.s.
s→0 1
2s log log s
1
If the s here is replaced by t we obtain
Xt
lim sup √ = 1 a.s.
t→∞ 2t log log t
as we wanted.
From Theorem 6.4.3 it easily follows, by an argument similar to the one we used in the proof
of Corollary 6.4.2, that for t → ∞ a standard Brownian motion will cross any given level
x ∈ R infinitely many times.
6.5 Exercises
Exercise 6.1. Let X = (Xt )t≥0 be a Brownian motion with drift ξ ∈ R and variance
σ 2 > 0. Define for each t ≥ 0
Xt − ξt
X̃t =
σ
Show that X̃ = (X̃t )t≥0 is a normalised Brownian motion. ◦
Exercise 6.2. Assume that (Ω, F, P ) is a probability space, and assume that D1 , D2 ⊆ F
both are ∩–stable collections of sets. Assume that D1 and D2 are independent, that is
Exercise 6.3. Let X = (Xt )t≥0 be a Brownian Motion with drift ξ and variance σ 2 > 0.
Define for each t > 0 the σ–algebra
Ft = F(Xs : 0 ≤ s ≤ t)
You can use without argument that (similarly to the arguments in the beginning of Sec-
tion 6.1) (Xs )0≤s≤t has values in (R[0,t] , B[0,t] ) where the σ–algebra B[0,t] is generated by
G = { (X̂t1 , . . . , X̂tn ) ∈ Bn : n ∈ N, 0 ≤ t1 < · · · < tn ≤ t, Bn ∈ Bn }
Exercise 6.4. Assume that X = (Xt )t≥0 is a Brownian Motion with drift ξ ∈ R and variance
σ 2 > 0. Define Ft as in Exercise 6.3. Show that X has the following Markov property for
t≥s
E(Xt |Fs ) = E(Xt |Xs ) a.s.
224 The Brownian motion
Exercise 6.5. Assume that X = (Xt )t≥0 is a normalised Brownian motion and define Ft as
in Exercise 6.3 for each t ≥ 0. Show that
Exercise 6.6. Let X = (Xt )t≥0 be a normalised Brownian motion. Let T > 0 be fixed and
define the process B T = (BtT )0≤t≤T by
t
BtT = Xt − XT
T
The process B T is called a Brownian bridge on [0, T ].
(BtT1 , . . . , BtTn )
Exercise 6.7. A stochastic process X = (Xt )t≥0 is self–similar if there exists H > 0 such
that
D
(Xγt )t≥0 = (γ H Xt )t≥0 for all γ > 0 .
intuitively, this means that if we ”zoom in” on the process, then it looks like a scaled version
of the original process.
6.5 Exercises 225
(1) Assume that X = (Xt )t≥0 is a Brownian motion with drift 0 and variance σ 2 . Show
that X is self–similar and find the parameter H.
Now assume that X = (Xt )t≥0 is a stochastic process (defined on (Ω, F, P )) that is self–
similar with parameter 0 < H < 1. Assume that X has stationary increments:
D
Xt − Xs = Xt−s for all 0 ≤ s ≤ t .
Xt − Xs D
= (t − s)H−1 X1 .
t−s
Exercise 6.8. Let Y be a normalised Brownian motion with continuous sample paths.
Hence Y can be considered as a random variable with values in (C[0,∞) , C[0,∞) ). Define for
all n, M ∈ N the set
( )
|xt | 1
Cn,M = x ∈ C[0,∞) sup >
t∈[n,n+1] t M
◦
Chapter 7
Further reading
In this final chapter, we reflect on the theory presented in the previous chapters and give
recommendations for further reading.
The material covered in Chapter 1 can be found scattered in many textbooks on probability
theory, such as Breiman (1968), Loève (1977a), Kallenberg (2002) and Rogers & Williams
(2000a).
Abstract results in ergodic theory as presented in Chapter 2 can be found in Breiman (1968)
and Loève (1977b). A major application of ergodic theory is to the theory of the class
of stochastic processes known as Markov processes. In Markov process theory, stationary
processes are frequently encountered, and thus the theory presents an opportunity for utilizing
the ergodic theorem for stationary processes. Basic introductions to Markov processes in both
discrete and continuous time can be found in Norris (1999) and Brémaud (1999). In Meyn
& Tweedie (2009), a more general theory is presented, which includes a series of results on
ergodic Markov processes.
In Chapter 3, we dealt with weak convergence. In its most general form, weak convergence
of probability measures can be cast in the context of probability measures on complete,
separable metric spaces, where the metric space considered is endowed with the Borel-σ-
algebra generated by the open sets. A classical exposition of this theory is found in Billingsley
(1999), with Parthasarathy (1967) also being a useful resource.
228 Further reading
Being one of the cornerstones of modern probability, the discrete-time martingale theory of
Chaper 5 can be found in many textbooks. A classical source is Rogers & Williams (2000a).
Supplementary material
In this chapter, we outline results which are either assumed to be well-known, or which are
of such auxiliary nature as to merit separation from the main text.
In this section, we recall some basic results on the supremum and infimum of a set in the
extended real numbers, as well as the limes superior and limes inferior of a sequence in R.
By R∗ , we denote the set R ∪ {−∞, ∞}, and endow R∗ with its natural ordering, in the sense
that −∞ < x < ∞ for all x ∈ R. We refer to R∗ as the extended real numbers. In general,
working with R∗ instead of merely R is useful, although somewhat technically inconvenient
from a formal point of view.
Definition A.1.1. Let A ⊆ R∗ . We say that y ∈ R∗ is an upper bound for A if it holds for
all x ∈ A that x ≤ y. Likewise, we say that y ∈ R∗ is a lower bound for A if it holds for all
x ∈ A that y ≤ x.
The elements sup A and inf A whose existence and uniqueness are stated in Theorem A.1.2
are known as the supremum and infimum of A, respectively, or as the least upper bound and
greatest lower bound of A, respectively.
In general, the formalities regarding the distinction between R and R∗ are necessary to keep
in mind when concerned with formal proofs, however, in practice, the supremum and infimum
of a set in R∗ is what one expects it to be: For example, the supremum of A ⊆ R∗ is infinity
precisely if A contains “arbitrarily large elements”, otherwise it is the “upper endpoint” of
the set, and similarly for the infimum.
The following yields useful characterisations of the supremum and infimum of a set when the
supremum and infimum is finite.
Lemma A.1.3. Let A ⊆ R∗ and let y ∈ R. Then y is the supremum of A if and only if the
following two properties hold:
Likewise, y is the infimum of A if and only if the following two properties hold:
Proof. We just prove the result on the supremum. Assume that y is the supremum of A. By
definition, y is then an upper bound for A. Let ε > 0. If y − ε were an upper bound for A,
we would have y ≤ y − ε, a contradiction. Therefore, y − ε is not an upper bound for A, and
so there exists x ∈ A such that y − ε < x. This proves that the two properties are necessary
for y to be the supremum of A.
To prove the converse, assume that the two properties hold, we wish to show that y is the
supremum of A. By our assumptions, y is an upper bound for A, so it suffices to show that
for any upper bound z ∈ R∗ , we have y ≤ z. To obtain this, note that by the second of our
A.1 Limes superior and limes inferior 231
Lemma A.1.5. Let A ⊆ R∗ and assume that A is nonempty. Then inf A ≤ sup A.
Lemma A.1.6. Let A ⊆ R∗ . Put −A = inf{−x | x ∈ A}. Then − sup A = inf(−A) and
− inf A = sup(−A).
Lemma A.1.7. Let A ⊆ R∗ , and let y ∈ R. Then sup A > y if and only if there exists x ∈ A
with x > y. Analogously, inf A < y if and only if there exists x ∈ A with x < y.
Proof. We prove the result on the supremum. Assume that sup A > y. If sup A is infinite, A
is not bounded from above, and so there exists arbitrarily large elements in A, in particular
there exists x ∈ A with x > y. If sup A is finite, Lemma A.1.3 shows that with ε = sup A − y,
there exists x ∈ A such that y = sup A − ε < x. This proves that if sup A > y, there exists
x ∈ A with x > y. Conversely, if there is x ∈ A with x > y, we also obtain y < x ≤ sup A,
since sup A is an upper bound for A. This proves the other implication.
Note that the result of Lemma A.1.7 is false if the strict inequalities are exchanged with
inequalities. For example, sup[0, 1) ≥ 1, but there is no x ∈ [0, 1) with x ≥ 1. Next, we turn
our attention to sequences.
232 Supplementary material
and refer to lim supn→∞ xn and lim inf n→∞ xn as the limes superior and limes inferior of
(xn ), respectively.
The limes superior and limes inferior are useful tools for working with sequences and in
particular for proving convergence.
Lemma A.1.9. Let (xn ) be a sequence in R. Then lim inf n→∞ xn ≤ lim supn→∞ xn .
Theorem A.1.10. Let (xn ) be a sequence in R, and let c ∈ R∗ . xn converges to c if and only
if lim inf n→∞ xn = lim supn→∞ xn = c. In particular, (xn ) is convergent to a finite limit if
and only if the limes inferior and limes superior are finite and equal, and in the affirmative,
the limit is equal to the common value of the limes inferior and the limes superior.
Proof. By Theorem A.1.10, it holds that lim supn→∞ xn = 0 if xn converges to zero. Con-
versely, assume that lim supn→∞ xn = 0. As zero is a lower bound for (xn ), we find
0 ≤ lim inf n→∞ xn ≤ lim supn→∞ xn = 0, so Theorem A.1.10 shows that xn converges
to zero.
We will often use Corollary A.1.11 to show various kinds of convergence results. We also
have the following useful results for the practical manipulation of expressions involving the
limes superior and limes inferior.
A.2 Measure theory and real analysis 233
Lemma A.1.12. Let (xn ) and (yn ) be sequences in R. Given that all the sums are well-
defined, the following holds.
If xn ≤ yn , it holds that
Proof. The relationships in (A.1) and (A.2) are proved in Lemma C.14 of Hansen (2009).
Considering (A.3), let y be the limit of (yn ) and let m ≥ 1 be so large that y − ε ≤ yn ≤ y + ε
for n ≥ m. For such n, we then have
yielding lim inf n→∞ xn + y − ε ≤ lim inf n→∞ (xn + yn ) ≤ lim inf n→∞ xn + y + ε. As ε > 0
was arbitrary, this yields (A.3). By a similar argument, we obtain (A.4). Furthermore, (A.5)
and (A.6) follow from Lemma A.1.6. The relationships (A.7) and (A.8) are proved in Lemma
C.12 of Hansen (2009).
In this section, we recall some of the main results from measure theory and real analysis
which will be needed in the following. We first recall some results from basic measure theory,
see Hansen (2009) for a general exposition.
234 Supplementary material
We say that a pair (E, E), where E is some set and E is a σ-algebra on E, is a measurable
space. Also, if H is some set of subsets of E, we define σ(H) to be the smallest σ-algebra
containing H, meaning that H is the intersection of all σ-algebras on E containing H. For a
σ-algebra E on E and a family H of subsets of E, we say that H is a generating family for
E if E = σ(H). One particular example of this is the Borel σ-algebra BA on A ⊆ Rn , which
is the smallest σ-algebra on A containing all open sets in A. In particular, we denote by Bn
the Borel σ-algebra on Rn .
If it holds for all A, B ∈ H that A ∩ B ∈ H, we say that H is stable under finite intersections.
Also, if D is a family of subsets of E, we say that D is a Dynkin class if it satisfies the
following requirements: E ∈ D, if A, B ∈ D with A ⊆ B then B \ A ∈ D, and if (An ) ⊆ D
with An ⊆ An+1 for all n ≥ 1, then ∪∞ n=1 An ∈ D. We have the following useful result.
Lemma A.2.2 (Dynkin’s lemma). Let D be a Dynkin class on E, and let H be a set of
subsets of E which is stable under finite intersections. If H ⊆ D, then σ(H) ⊆ D.
Proof. See Theorem 3.6 of Hansen (2009), or Theorem 4.1.2 of Ash (1972).
Definition A.2.3. Let (E, E) be a measurable space. We say that a function µ : E → [0, ∞)
is a measure, if it holds that µ(∅) = 0 and that whenever (An ) ⊆ E is a sequence of pairwise
P∞
disjoint sets, µ(∪∞
n=1 An ) = n=1 µ(An ).
We say that a triple (E, E, µ) is a measure space. Also, if there exists an increasing sequence
of sets (En ) ⊆ E with E = ∪∞ n=1 En and such that µ(En ) is finite, we say that µ is σ-finite
and refer to (E, E, µ) as a σ-finite measure space. If µ(E) is finite, we say that µ is finite, and
if µ(E) = 1, we say that µ is a probability measure. In the latter case, we refer to (E, E, µ)
as a probability space. An important application of Lemma A.2.2 is the following.
Theorem A.2.4 (Uniqueness theorem for probability measures). Let P and Q be two prob-
ability measures on (E, E). Let H be a generating family for E which is stable under finite
intersections. If P (A) = Q(A) for all A ∈ H, then P (A) = Q(A) for all A ∈ E.
Definition A.2.5. Let (E, E) and (F, F) be two measurable spaces. Let f : E → F be some
mapping. We say that f is E-F measurable if f −1 (A) ∈ E whenever A ∈ F.
For a family of mappings (fi )i∈I from E to Fi , where (Fi , Fi ) is some measurable space,
we may introduce σ((fi )i∈I ) as the smallest σ-algebra E on E such that all the fi are E-Fi
measurable. Formally, E is the σ-algebra generated by {(fi ∈ A) | i ∈ I, A ∈ Fi }. For
measurability with respect to such σ-algebras, we have the following very useful lemma.
Lemma A.2.6. Let E be a set, let (fi )i∈I be a family of mappings from E to Fi , where
(Fi , Fi ) is some measurable space, and let E = σ((fi )i∈I ). Let (H, H) be some other mea-
surable space, and let g : H → E. Then g is H-E measurable if and only if fi ◦ g is H-Fi
measurable for all i ∈ I.
Proof. See Lemma 4.14 of Hansen (2009) for a proof in the case of a single variable.
Theorem A.2.7 (The monotone convergence theorem). Let (E, E, µ) be a measure space,
and let (fn ) be a sequence of measurable mappings fn : E → [0, ∞]. Assume that the sequence
(fn ) is increasing µ-almost everywhere. Then
Z Z
lim fn dµ = lim fn dµ.
n→∞ n→∞
Lemma A.2.8 (Fatou’s lemma). Let (E, E, µ) be a measure space, and let (fn ) be a sequence
of measurable mappings fn : E → [0, ∞]. It holds that
Z Z
lim inf fn dµ ≤ lim inf fn dµ.
n→∞ n→∞
236 Supplementary material
Theorem A.2.9 (The dominated convergence theorem). Let (E, E, µ) be a measure space,
and let (fn ) be a sequence of measurable mappings from E to R. Assume that the sequence
(fn ) converges µ-almost everywhere to some mapping f . Assume that there exists a measur-
able, integrable mapping g : E → [0, ∞) such that |fn | ≤ g µ-almost everywhere for all n.
Then fn is integrable for all n ≥ 1, f is measurable and integrable, and
Z Z
lim fn dµ = lim fn dµ.
n→∞ n→∞
For the next result, recall that for two σ-finite measure spaces (E, E, µ) and (F, F, ν), E ⊗ F
denotes the σ-algebra on E × F generated by {A × B | A ∈ E, B ∈ F}, and µ ⊗ ν denotes
the unique σ-finite measure such that (µ ⊗ ν)(A × B) = µ(A)ν(B) for A ∈ E and B ∈ F, see
Chapter 9 of Hansen (2009).
Theorem A.2.10 (Tonelli’s theorem). Let (E, E, µ) and (F, F, ν) be two σ-finite measure
spaces, and assume that f is nonnegative and E ⊗ F measurable. Then
Z Z Z
f (x, y) d(µ ⊗ ν)(x, y) = f (x, y) dν(y) dµ(x).
Theorem A.2.11 (Fubini’s theorem). Let (E, E, µ) and (F, F, ν) be two σ-finite measure
spaces, and assume that f is E ⊗ F measurable and µ ⊗ ν integrable. Then y 7→ f (x, y) is
integrable with respect to ν for µ-almost all x, the set where this is the case is measurable,
and it holds that
Z Z Z
f (x, y) d(µ ⊗ ν)(x, y) = f (x, y) dν(y) dµ(x).
Theorem A.2.7, Lemma A.2.8 and Theorem A.2.9 are the three main tools for working with
integrals. Theorem A.2.12 is frequently useful as well, and can in purely probabilistic terms
be stated as the result that f (EX) ≤ Ef (X) when f is convex.
Also, for measurable spaces (E, E) and (F, F), µ a measure on (E, E) and t : E → F an E-F
measurable mapping, we define the image measure t(µ) as the measure on (F, F) given by
putting t(µ)(A) = µ(t−1 (A)) for A ∈ F. We then have the following theorem on successive
transformations.
Theorem A.2.13. Let (E, E), (F, F) and (G, G) be measurable spaces. Let µ be a measure
on (E, E). Let t : E → F and s : F → G be measurable. Then s(t(µ)) = (s ◦ t)(µ).
Theorem A.2.14. Let (E, E, µ) be a measurable space and let (F, F) be some measure
space. Let t : E → F be measurable, and let f : F → R be Borel measurable. Then f
is t(µ)-integrable if and only if f ◦ t is µ-integrable, and in the affirmative, it holds that
R R
f dt(µ) = f ◦ t dµ.
Definition A.2.15. Let (E, E, µ) be a measurable space, and let p ≥ 1. By Lp (E, E, µ), we
denote the set of measurable mappings f : E → R such that |f |p dµ is finite.
R
is a vector space and that k·kp is a seminorm on this space is a consequence of the Minkowski
inequality, see Theorem 2.4.7 of Ash (1972). We refer to Lp (E, E, µ) as an Lp -space. For
Lp -spaces, the following two main results hold.
Theorem A.2.16 (Hölder’s inequality). Let p > 1 and let q be the dual exponent to p,
meaning that q > 1 is uniquely determined as the solution to the equation p1 + 1q = 1. If
f ∈ Lp (E, E, µ) and g ∈ Lq (E, E, µ), it holds that f g ∈ L1 (E, E, µ), and kf gk1 ≤ kf kp kgkq .
238 Supplementary material
Following these results, we recall a simple lemma which we make use of in the proof of the
law of large numbers.
Lemma A.2.18. Let (xn ) be some sequence in R, and let x be some element of R. If
Pn
limn→∞ xn = x, then limn→∞ n1 k=1 xk = x as well.
Also, we recall some properties of the integer part function. For any x ∈ R, we define
[x] = sup{n ∈ Z | n ≤ x}.
Lemma A.2.19. It holds that [x] is the unique integer such that [x] ≤ x < [x] + 1, or
equivalently, the unique integer such that x − 1 < [x] ≤ x.
Proof. We first show that [x] satisfies the bounds [x] ≤ x < [x] + 1. As x is an upper bound
for the set {n ∈ Z | n ≤ x}, and [x] is the least upper bound, we obtain [x] ≤ x. On the
other hand, as [x] is an upper bound for {n ∈ Z | n ≤ x}, [x] + 1 cannot be an element of
this set, yielding x < [x] + 1.
This shows that [x] satisfies the bounds given. Now assume that m is an integer satisfying
m ≤ x < m + 1, we claim that m = [x]. As m ≤ x, we obtain m ≤ [x]. And as x < m + 1,
m + 1 is not in {n ∈ Z | n ≤ x}. In particular, for all n ≤ x, n < m + 1. As [x] ≤ x, this
yields [x] < m + 1, and so [x] ≤ m. We conclude that m = [x], as desired.
Proof. As [x] ≤ x < [x] + 1 is equivalent to [x] + n ≤ x + n < [x] + n + 1, the characterization
of Lemma A.2.19 yields the result.
A.3 Existence of sequences of random variables 239
Theorem A.2.21. Let n ≥ 1 and assume that f is n times differentiable, and let x, y ∈ Rp .
It then holds that
n−1
X f (k) (x) f (n) (ξ(x, y))
f (y) = (y − x)k + (y − x)n ,
n! n!
k=0
where f (k) denotes the k’th derivative of f , with the convention that f (0) = f , and ξ(x, y) is
some element on the line segment between x and y.
In this section, we state a result which yield the existence of particular types of sequences of
random variables.
Theorem A.3.1 (Kolmogorov’s consistency theorem). Let (Qn )n≥1 be a sequence of prob-
ability measures such that Qn is a probability measure on (Rn , Bn ). For each n ≥ 2, let
πn : Rn → Rn−1 denote the projection onto the first n − 1 coordinates. Assume that
πn (Qn ) = Qn−1 for all n ≥ 2. There exists a probability space (Ω, F, P ) and a sequence
of random variables (Xn )n≥1 on (Ω, F, P ) such that for all n ≥ 1, (X1 , . . . , Xn ) have distri-
bution Qn .
Proof. This follows from Theorem II.30.1 of Rogers & Williams (2000a).
Corollary A.3.2. Let (Qn )n≥1 be a sequence of probability measures on (R, B). There exists
a probability space (Ω, F, P ) and a sequence of random variables (Xn )n≥1 on (Ω, F, P ) such
that for all n ≥ 1, (X1 , . . . , Xn ) are independent, and Xn has distribution Qn .
Proof. This follows from applying Theorem A.3.1 with the sequence of probability measures
(Q1 ⊗ · · · ⊗ Qn )n≥1 .
From Corollary A.3.2, it follows for example that there exists a probability space (Ω, F, P )
and a sequence of independent random variables (Xn )n≥1 on (Ω, F, P ) such that Xn is dis-
240 Supplementary material
tributed on {0, 1} with P (Xn = 1) = pn , where (pn ) is some sequence in [0, 1]. Such sequences
are occasionally used as examples or counterexamples regarding certain propositions.
A.4 Exercises
Exercise A.2. Let y ∈ R. Define A = {x ∈ Q | x < y}. Find sup A and inf A. ◦
Exercise A.3. Let (E, E, µ) be a measure space, and let (fn ) be a sequence of measurable
mappings fn : E → [0, ∞). Assume that there is g : E → [0, ∞) such that fn ≤ g for all n ≥ 1,
R R
where g is integrable with respect to µ. Show that lim supn→∞ fn dµ ≤ lim supn→∞ fn dµ.
◦
Appendix B
Hints for exercise 1.2. Consider the probability space (Ω, F, P ) = ([0, 1], B[0,1] , λ), where λ
denotes the Lebesgue measure on [0, 1]. As a counterexample, consider variables defined as
Xn (ω) = nλ(An )−1 1An for an appropriate sequence of intervals (An ). ◦
Hints for exercise 1.4. Show that the sequence (EXn )n≥1 diverges and use this to obtain the
result. ◦
Hints for exercise 1.5. Show that for any ω with P ({ω}) > 0, Xn (ω) converges to X(ω).
Obtain the desired result by noting that {ω | P ({ω}) > 0} is an almost sure set. ◦
Hints for exercise 1.6. Show that P (|Xn − X| ≥ ε) ≤ P (|Xn 1Fk − X1Fk | ≥ ε) + P (Fkc ), and
use this to obtain the result. ◦
Hints for exercise 1.7. To prove that limn→∞ P (|Xn − X| ≥ εk ) = 0 for all k ≥ 1 implies
P
Xn −→ X, take ε > 0 and pick k such that 0 ≤ εk ≤ ε. ◦
Hints for exercise 1.8. Using that the sequence (supk≥n |Xk − X| > ε)n≥1 is decreasing, show
that limn→∞ P (supk≥n |Xk − X| > ε) = P (∩∞ ∞
n=1 ∪k=n |Xk − X| > ε). Use this to prove the
result. ◦
242 Hints for exercises
Hints for exercise 1.9. To obtain that d is a pseudometric, use that x 7→ x(1 + x)−1 is
P
increasing on [0, ∞). To show that Xn −→ X implies convergence in d, prove that for all
ε
ε > 0, it holds that d(Xn , X) ≤ P (|Xn − X| > ε) + 1+ε . In order to obtain the converse,
apply Lemma 1.2.7. ◦
1 1
Hints for exercise 1.10. Choose cn as a positive number such that P (|Xn | ≥ ncn ) ≤ 2n . Use
Lemma 1.2.11 to show that this choice yields the desired result. ◦
Hints for exercise 1.11. To prove the first claim, apply Lemma 1.2.7 with p = 4. To prove
the second claim, apply Lemma 1.2.12. ◦
Hints for exercise 1.12. Use Lemma 1.2.13 and Fatou’s lemma to show that E|X|p is finite.
Apply Hölder’s inequality to obtain convergence in Lq for 1 ≤ q < p. ◦
Hints for exercise 1.14. Apply Lemma 1.2.12 and Lemma 1.2.7. ◦
Hints for exercise 1.15. Use Lemma 1.2.13, also recalling that all sequences in R which are
monotone and bounded are convergent. ◦
Hints for exercise 1.16. First argue that (|Xn+1 − Xn | ≤ εn evt.) is an almost sure set. Use
P∞
this to show that almost surely, for n > m large enough, |Xn − Xm | ≤ k=m εk . Using that
P∞
k=m εk tends to zero as m tends to infinity, conclude that (Xn ) is almost surely Cauchy. ◦
Hints for exercise 1.17. To prove almost sure convergence, calculate an explicit expression for
P (|Xn − 1| ≥ ε) and apply Lemma 1.2.12. To prove convergence in Lp , apply the dominated
convergence theorem. ◦
Hints for exercise 1.19. Use Lemma 1.3.12 to prove the contrapositive of the desired impli-
cation. ◦
Hints for exercise 1.20. To calculate P (Xn / log n > c i.o.), use the properties of the expo-
B.1 Hints for Chapter 1 243
nential distribution to obtain an explicit expression for P (Xn / log n > c), then apply Lemma
1.3.12. To prove lim supn→∞ Xn / log n = 1 almost surely, note that for all c > 0, it holds
that lim supn→∞ Xn / log n ≤ c when Xn / log n ≤ c eventually and lim supn→∞ Xn / log n ≥ c
when Xn / log n > c infinitely often. ◦
Hints for exercise 1.21. Use that sequence (∪k=n (Xk ∈ B))n≥1 is decreasing to obtain that
(Xn ∈ B i.o.) is in J . Take complements to obtain the result on (Xn ∈ B evt.). ◦
n
! n
!
X X
lim an−k+1 Xk ∈ B = lim an−k+1 Xk ∈ B ,
n→∞ n→∞
k=1 k=m
Hints for exercise 1.23. For the result on convergence in probability, work directly from the
definition of convergence in probability and consider 0 < ε < 1 in this definition. For the
result on almost sure convergence, note that Xn converges to zero if and only if Xn is zero
eventually, and apply Lemma 1.3.12. ◦
Pn
Hints for exercise 1.25. Use Theorem 1.3.10 to show that k=1 ak Xk either is almost surely
divergent or almost surely convergent. To obtain the sufficient criterion for convergence,
apply Theorem 1.4.2. ◦
Hints for exercise 1.26. Let (Xn ) be an sequence of independent random variables concen-
trated on {0, n} with P (Xn = n) = pn . Use Lemma 1.3.12 to choose (pn ) so as to obtain the
result. ◦
Hints for exercise 1.28. Use Lemma 1.3.12 to conclude that P (|Xn | > n i.o.) = 0 if and
P∞
only if n=1 P (|X1 | > n) is finite. Apply the monotone convergence theorem and Tonelli’s
theorem to conclude that the latter is the case if and only if E|X1 | is finite. ◦
Hints for exercise 1.29. Apply Exercise 1.28 to show that E|X1 | is finite. Apply Theorem
1.5.3 to show that EX1 = c. ◦
244 Hints for exercises
Hints for exercise 2.1. To show that T is measure preserving, find simple explicit expressions
for T (x) for 0 ≤ x < 12 and 21 ≤ x < 1, respectively, and use this to show the relationship
P (T −1 ([0, α))) = P ([0, α)) for 0 ≤ α ≤ 1. Apply Lemma 2.2.1 to obtain that T is P -measure
preserving. To show that S is measure preserving, first show that it suffices to consider the
case where 0 ≤ λ < 1. Fix 0 ≤ α ≤ 1. Prove that for α ≥ µ, S −1 ([0, α)) = [0, α−µ)∪[1−µ, 1),
and for α < µ, S −1 ([0, α)) = [1 − µ, 1 − µ + α). Use this and Lemma 2.2.1 to obtain the
result. ◦
Hints for exercise 2.2. Apply Lemma 2.2.1. To do so, find a simple explicit expression
1
for T (x) when n+1 < x ≤ n1 , and use this to calculate, for 0 ≤ α < 1, T −1 ([0, α)) and
subsequently P (T −1 ([0, α))), ◦
Hints for exercise 2.3. Assume, expecting a contradiction, that P is a probability measure
such that T is measure preserving for P . Show that this implies P ({0}) = 0 and that
P (( 21n , 22n ]) = 0 for all n ≥ 1. Use this to obtain the desired contradiction. ◦
Hints for exercise 2.4. Let λ = n/m for n ∈ Z and m ∈ N. Show that T m (x) = x in this
case. Fix 0 ≤ α ≤ 1 and put Fα = ∪m−1 k=0 T
−k
([0, α]) and show that for α small and positive,
Fα is a set in the T -invariant σ-algebra which has a measure not equal to zero or one. ◦
R
Hints for exercise 2.5. Prove that X − X ◦ T dP = 0 and use this to obtain the result. ◦
Hints for exercise 2.6. Show that IT ⊆ IT 2 and use this to prove the result. ◦
Hints for exercise 2.7. Consider a space Ω containing only two points. ◦
Hints for exercise 2.8. For part two, note that ∪∞ k=n T
−k
(F ) ⊆ ∪∞
k=0 T
−k
(F ) and use that T
∞ −k
is measure preserving. For part three, use that F ⊆ ∪k=0 T (F ). For part four, use that
F = (F ∩ (T k ∈ F c evt.)) ∪ (F ∩ (T k ∈ F c evt.)c ) ◦
Hints for exercise 2.9. To show that the criterion is sufficient for T to be ergodic, use
Theorem 2.2.3. For the converse implication, assume that T is ergodic and use Theorem
2.2.3 to argue that the result holds when X and Y are indicators for sets in F. Consider
X = 1G and Y nonnegative and bounded and use linearity and approximation with simple
functions to obtain that the criterion also holds in this case. Use a similar argument to
B.2 Hints for Chapter 2 245
obtain the criterion for general X and Y such that X is nonnegative and integrable and Y
is nonnegative and bounded. Use linearity to obtain the final extension to X integrable and
Y bounded. ◦
Hints for exercise 2.10. First use Lemma 2.2.6 to argue that it suffices to show that for
α, β ∈ [0, 1), limn→∞ P ([0, β) ∩ T −n ([0, α))) = P ([0, β))P ([0, α)). To do so, first show that
T n (x) = 2n x − [2n x] and use this to obtain a simple explicit expression for T −n ([0, α)). Use
this to prove the desired result. ◦
Hints for exercise 2.11. For part one, use that the family {F1 × F2 | F1 ∈ F1 , F2 ∈ F2 } is
a generating family for F1 ⊗ F2 which is stable under finite intersections and apply Lemma
2.2.1. For part two, show that whenever F1 is T1 -invariant and F2 is T2 -invariant, F1 × F2 is
T -invariant, and use this to obtain the desired result. For part three, use that for F1 ∈ F1
and F2 ∈ F2 , it holds that P1 (F1 ) = P (F1 × Ω2 ) and P2 (F2 ) = P (Ω1 × F2 ). For part four,
use Lemma 2.2.6. ◦
Hints for exercise 2.12. Let A = (X̂n ∈ B i.o.) and note that (Xn ∈ B i.o.) = X −1 (A).
Show that A is θ-invariant to obtain the result. ◦
Hints for exercise 2.13. For B ∈ B∞ , express Z(P )(B) in terms of X(P )(B), Y (P )(B) and
p. Use this to obtain that θ is measure preserving for Z(P ). ◦
Hints for exercise 2.14. Assume that (Xn ) is stationary. Using that θ is X(P )-measure
preserving, argue that all Xn have the same distribution and conclude that EXn = EXk for
all n, k ≥ 1. Using a similar argument, argue that for all 1 ≤ n ≤ k, (Xn , Xk ) has the same
distribution as (X1 , Xk−(n−1) ) and conclude that Cov(Xn , Xk ) = Cov(X1 , Xk−(n−1) ). Use
this to conclude that (Xn ) is weakly stationary. ◦
Hints for exercise 2.15. Use Exercise 2.14 to argue that if (Xn ) is stationary, it is also weakly
stationary. To obtain the converse implication, assume that (Xn ) is stationary and argue that
for all n ≥ 1, (X2 , . . . , Xn+1 ) has the same distribution as (X1 , . . . , Xn ). Combine this with
the assumption that (Xn ) has Gaussian finite-dimensional distributions in order to obtain
stationarity. ◦
246 Hints for exercises
Hints for exercise 3.1. First assume that (θn ) converges with limit θ. In the case where
θ > 0, apply Lemma 3.1.9 to obtain weak convergence. In the case where θ = 0, prove weak
R
convergence directly by proving convergence of f dµn for f ∈ Cb (R).
Next, assume that (µn ) is weakly convergent. Use Lemma 3.1.6 to argue that (θn ) is bounded.
Assume that (θn ) is not convergent, and argue that there must exist two subsequences (θnk )
and (θmk ) with different limits θ and θ∗ . Use what was already shown and Lemma 3.1.5 to
obtain a contradiction. ◦
Hints for exercise 3.2. To obtain weak convergence when the probabilities converge, apply
Lemma 3.1.9. To obtain the converse implication, use Lemma 3.1.3 to construct for each k a
mapping in Cb (R) which takes the value 1 on k and takes the value zero for, say (k −1, k +1)c .
Use this mapping to obtain convergence of the probabilities. ◦
n
Hints for exercise 3.3. Applying Stirlings’s formula and the fact that limn→∞ 1 + nx = ex
for all x ∈ R to prove that the densities fn converges pointwise to the density of the normal
disribution. Invoke Lemma 3.1.9 to obtain the desired result. ◦
Hints for exercise 3.4. Using Stirling’s formula as well as the result that if (xn ) is a sequence
n
converging to x, then limn→∞ 1 + xnn = ex , prove that the probability functions converge
pointwise. Apply Lemma 3.1.9 to obtain the result. ◦
Hints for exercise 3.5. Apply Stirling’s formula and Lemma 3.1.9. ◦
Hints for exercise 3.6. Let Fn be the cumulative distribution function for µn . Using the
properties of cumulative distribution functions, show that for x ∈ R satisfying the inequalities
q(k/(n + 1)) < x < q((k + 1)/(n + 1)), with k ≤ n, it holds that |Fn (x) − F (x)| ≤ 2/(n + 1).
Also show that for x < q(1/(n + 1)) and x > q(n/(n + 1)), Fn (x) − F (x)| ≤ 1/(n + 1). Then
apply Theorem 3.2.3 to obtain the result. ◦
Hints for exercise 3.7. First assume that (ξn ) and (σn ) converge to limits ξ and σ. In the
case where σ > 0, apply Lemma 3.1.9 to obtain weak convergence of µn . In the case where
σ = 0, use Theorem 3.2.3 to obtain weak convergence.
Next, assume that µn converges weakly. Use Lemma 3.1.6 to show that both (ξn ) and (σn )
B.3 Hints for Chapter 3 247
are bounded. Then apply a proof by contradiction to show that ξn and σn both must be
convergent. ◦
Hints for exercise 3.8. Let ε > 0, and take n so large that |xn − x| ≤ ε. Use Lemma 3.2.1
and the monotonicity properties of cumulative distribution functions to obtain the set of
inequalities F (x − ε) ≤ lim inf n→∞ Fn (xn ) ≤ lim supn→∞ Fn (xn ) ≤ F (x + ε). Use this to
prove the desired result. ◦
Hints for exercise 3.9. Argue that with Fn denoting the cumulative distribution function for
µn , it holds that Fn (x) = 1 − (1 − 1/n)[nx] , where [nx] denotes the integer part of nx. Use
l’Hôpital’s rule to prove pointwise convergence of Fn (x) as n tends to infinity, and invoke
Theorem 3.2.3 to conclude that the desired result holds. ◦
Hints for exercise 3.11. Use the Taylor expansion of the exponential function. ◦
Hints for exercise 3.12. Use independence of X and Y to express the characteristic function
of XY as an integral with respect to µ ⊗ ν. Apply Fubini’s theorem to obtain the result. ◦
Hints for exercise 3.13. Argue that XY and −ZW are independent and follow the same
distribution. Use Lemma 3.4.15 to 3.13 to express the characteristic function of XY − ZW
in terms of the characteristic function of XY . Apply Exercise 3.12 and Example 3.4.10 to
obtain a closed expression for this characteristic function. Recognizing this expression as the
characteristic function for the Laplace distribution, use Theorem 3.4.19 to obtain the desired
distributional result. ◦
pPn
Hints for exercise 3.14. Define a triangular array by putting Xnk = (Xn −EXn )/ k=1 V Xk .
Pn
Apply Theorem 3.5.6 to show that k=1 Xnk converges to the standard normal distribution,
and conclude the desired result from this. ◦
Hints for exercise 3.15. Fix ε > 0. Use independence and Lemma 1.3.12 to conclude that
P∞ P∞
n=1 P (|Xn | > ε) converges. To argue that V Xn 1(|Xn |≤ε) , assume that the series is
Pnn=1
divergent. Put Yn = Xn 1(|Xn |≤ε) and Sn = k=1 Yn . Use Exercise 3.15 to argue that Sn
√
converges almost surely, while (Sn − ESn )/ V Sn converges in distribution to the standard
√
normal distribution. Use Lemma 3.3.2 to conclude that ESn / V Sn converges in distribution
to the standard normal distribution. Obtain a contradiction from this. For the convergence
of the final series, apply Theorem 1.4.2. ◦
248 Hints for exercises
Hints for exercise 3.16. Use Theorem 3.5.3 to obtain that using the probability space
(Ω, F, Pλ ), X n is asymptotically normal. Fix a differentiable mapping f : R → R, and
use Theorem 3.6.3 to show that f (X n ) is asymptotically normal as well. Use the form of the
asymptotic parameters to obtain a requirement on f 0 for the result of the exercise to hold.
Identify a function f satisfying the requirements from this. ◦
Hints for exercise 3.17. In the case α > 1/2, use Lemma 1.2.7 to obtain the desired con-
vergence in probability. In the cas α ≤ 1/2, note by Theorem 3.5.3, that (Sn − nξ)/n1/2
converges in distribution to the standard normal distribution. Fix ε > 0 and use Lemma 3.1.3
to obtain a mapping g ∈ Cb (R) such that 1(ξ−2ε,ξ+2ε)c (x) ≤ g(x) ≤ 1[ξ−ε,ξ+ε]c (x). Use this
to prove that lim inf n→∞ P (|Sn − nξ|/nα ≥ ε) is positive, and conclude that (Sn − nξ)/nα
does not converge in probability to zero. ◦
Hints for exercise 3.18. First calculate EXn2 and EXn4 . Use Theorem 3.5.3, to obtain that X n
is asymptotically normal. Apply Theorem 3.6.3 to obtain that θ̂n is asymptotically normal
as well. ◦
Hints for exercise 3.19. Use Theorem 3.5.3 and Theorem 3.6.3 to obtain the results on X n
−1 P
and X n . In order to show Yn −→ 1/µ, calculate EYn and V Yn and use Lemma 1.2.7 to
prove the convergence. ◦
Hints for exercise 3.20. Use Theorem 3.5.3 to argue that X n is asymptotically normal. To
p
obtain the
√
result on (Yn − θ)/( 4θ2 /9n), define a triangular array (Xnk )n≥k≥1 by putting
36
Xnk = θn3/2 (kUk − kθ2 ) and apply Theorem 3.5.7. ◦
Hints for exercise 4.1. Use that ν1 and ν2 are signed measures to show that αν1 +βν2 satisfies
(i) and (ii) in the definition. ◦
Hints for exercise 4.2. For µ τ : Use that τ (A) = 0 if and only if A = ∅.
Hints for exercise 4.3. Define ν = ν1 + ν2 and argue that ν µ. Argue that there exists
R R
measurable and µ–integrable functions f, f1 , f2 with ν(F ) = F f dµ, ν1 (F ) = F f1 dµ, and
R
ν2 (F ) = F f2 dµ for all F ∈ F. Show that
Z Z Z
f dµ = f1 dµ + f2 dµ for all F ∈ F
F F F
and conclude the desired result. ◦
dν dν
Hints for exercise 4.4. Argue that ν µ and let f = dµ , h = dπ , and g = dπ
dµ . Note
that π and g are non–negative measure and density as known from Sand1. Show that
R R
ν(F ) = F hg dµ and ν(F ) F f dµ. ◦
dν
and g = dµ
R
Hints for exercise 4.5. Let f = dµ dν . Show that ν(F ) = F f g dν and ν(F ) =
R
F
1 dν. Conclude the desired result ν–a.e. Obtain the result µ–a.e. by symmetry. ◦
Hints for exercise 4.6. Find a disjoint sequence of sets (Fn ) such that µ(Fn ) < ∞ and
S
Fn = Ω. Define the measures µn (F ) = µ(F ∩ Fn ) and νn (F ) = ν(F ∩ Fn ) and show that
dνn P∞
νn µn for all n. Let fn = dµ n
and fn = 0 on Fnc (why is that OK?). Define f = n=1 fn .
Show that
Z ∞ Z
X Z
|f | dµ = · · · = fn dµ − fn dµ ≤ · · · ≤ 2 sup |ν(F )| < ∞
n=1 (fn >0) (fn <0) F ∈F
R
and that ν(F ) = F
f dµ. ◦
Hints for exercise 4.7. Show that X is a conditional expectation of Y given D, and that Y
is a conditional expectation of X given D. ◦
Hints for exercise 4.8. Straightforward application of Theorem 4.2.6 (2), (5) and (7). ◦
Hints for exercise 4.9. ”⇐” is trivial. For ”⇒” show that EX 2 = EY 2 and use that X = Y
a.s. if and only if E[(X − Y )2 ] = 0. Apply Theorem 4.2.6. ◦
Hints for exercise 4.11. Recall that x+ = max{x, 0}. Show and use P (0 ≤ E(X + |D)) = 1
and P (E(X|D) ≤ E(X + |D)) = 1. ◦
Hints for exercise 4.12. Use that |x| = x+ + x− and (−x)+ = x− and Exercise 4.11. ◦
250 Hints for exercises
Hints for exercise 4.13. Show that E(X|D) = 21 . Show and use that if D is countable, then
1D = 0 a.s. and 1Dc = 1 a.s. ◦
and use Dynkin’s lemma (Lemma A.2.2) to show that σ(G) ⊆ H (it is assumed that G ⊆ H).
◦
(1) Write E(Y |Z) = φ(Z) (!) so e.g. the left hand side equals E(φ(Z)1(Z∈B) 1(X∈C) ). Use
that Z ⊥
⊥ X and (Y, Z) ⊥⊥ X.
(2) Use Exercise 4.15 and 1 to show that E(Y |Z) is a conditional expectation of Y given
σ(Z, X).
(1) Show that σ(Sn , Sn+1 , Sn+2 , . . .) = σ(Sn , Xn+1 , Xn+2 , . . .). Use Exercise 4.16 and that
the Xn –variables are independent.
(2) First show that n1 Sn = E(X1 |Sn ) by checking the conditions for being a conditional
expectation of X1 given Sn . For the proof of
Z Z
1
Sn dP = X1 dP for all B ∈ B ,
(Sn ∈B) n (Sn ∈B)
D
use (and show) that 1(Sn ∈B) Xk = 1(Sn ∈B) X1 (the distributions are equal) for all k =
1, . . . , n.
◦
B.5 Hints for Chapter 5 251
Hints for exercise 5.5. For ⇒, let Fn = (τ = n). For ⇐, show and use that
(τ = m) = ∩m c
n=1 Fn ∩ Fm .
Hints for exercise 5.7. (1): For ⇐, let τ = n and k = n + 1. For ⇒, use Corollary 5.2.13. ◦
Hints for exercise 5.8. For (2): Use Exercise 5.4. For 3: Show E(S1 ) = 0 and use (2) in
a.s.
Exercise 5.7. For (4): Use The Strong of Large Numbers to show that Sn −→ +∞. For 5:
Use monotone convergence of both sides in (3). ◦
Hints for exercise 5.9. First show that (Sn , Fn ) is a martingale with a suitable choice of (Fn ).
P∞
Then use the independence to show that ESn2 ≤ n=1 EXn2 < ∞ for all n ∈ N. Finally
use The martingale convergence theorem (for the argument of supn ESn+ < ∞, recall that
x+ ≤ |x| ≤ x2 + 1). ◦
Hints for exercise 5.10. For (2): See that Mn ≥ 0 and use Theorem 5.3.2. For (3): Use
Fatou’s lemma. For (5): Use Exercise 5.4. For (7): Use Corollary 5.2.13 and the fact that
τ ∧ n and n are bounded stopping times. For (9): Use (7)+(8)+ dominated convergence. For
(10): Let q = P (Sτ = b) and write EMτ = qrb + (1 − q)ra . ◦
Hints for exercise 5.11. For (1): Exploit the inequality (x−y)2 ≥ 0. For the integrability, use
that 1Dn E(X|D) is bounded by n. For (2): use that both 1Dn and E(X|D) are D–measurable.
For (3): Use (1) and (2) to obtain that
E 1Dn X 2 − E(X|D)2 |D ≥ 0 a.s.
(1) Use Exercise 5.11 and that E(Xn+1 |Fn )2 = Xn2 a.s. by the martingale property.
(3) Use Corollary 5.2.13, since τ ∧ n and n are bounded stopping times.
(4) Write
Z Z
EXτ2∧n = Xτ2∧n dP + Xτ2∧n dP
(maxk=1,...,n |Xk |≥) (maxk=1,...,n |Xk |<)
and use (and prove) that |Xτ ∧n | ≥ on the set (maxk=1,...,n |Xk | ≥ ).
Hints for exercise 5.13. For (3): Show that An ∈ Fτ ∧n (recall the definition of Fτ ∧n ) and
use this to show Z Z Z
Yτ ∧n dP ≤ E(Yn |Fτ ∧n ) dP = Yn dP .
An An An
◦
Hints for exercise 5.14. (1): Use Theorem 5.4.5 and the fact that E|Xn − 0| = EXn .
(2): According to (1), the variables should have both positive and negative values. Use linear
combinations of indicator functions like 1[0, n1 ) and 1[ n1 , n2 ) . ◦
Hints for exercise 5.15. For (1): Use 10 in Theorem 4.2.6 and the definition of conditional
expectations. For (2): First divide into the two situations |X| ≤ K and |X| > K, and
secondly use Markov’s inequality. For (3): Obtain that for all K ∈ N
Z Z
lim sup |E(X|D)| dP ≤ |X| dP .
x→∞ D∈G |(E(X|D)|>x) (|X|>K)
(3) Show that EXτ ∧n → EXτ (use e.g. the remark before Definition 5.4.1) and that
EXτ ∧n = EX1 .
B.5 Hints for Chapter 5 253
for all x > 0 and n ∈ N. Use dominated convergence to see that the right hand side
→ 0 as x → ∞.
(5) Write
∞
X ∞
X
P (τ > n) = 1(k≥n) P (τ = k)
n=0 n=0
(6) Define Y as the right hand side of (5.32) and expand E|Y |:
∞ Z
X
E|Y | = . . . = E|X1 | + E(|Xm+1 − Xm | Fm ) dP
m=1 (τ >m)
Use E(|Xn+1 − Xn | Fn ) ≤ B a.s. and (5) to obtain that E|Y | < ∞. Use (1)-(4).
(9) Use (6)-(8) to show that EZσ = 0. Furthermore realise that Zσ = Sσ − σξ.
Hints for exercise 5.17. For (2): Use that all Yn ≥ 0. For (3): Write Y = lim inf n→∞ Yn
Pn
and use Fatou’s Lemma. For 4: Write Yn = exp k=1 log(Xk ) and use The Strong Law
Pn L1
of Large Numbers to show n1 k=1 log(Xk ) → ξ < 0 a.s.. For 5: Use that if Yn −→ Z
P L1
then Yn −→ Z and conclude that Z = 0 a.s. Realise that Yn 9 0 (You might need that if
P P
Yn −→ Y and Yn −→ Z, then Y = Z a.s.). ◦
(1) Use the triangle inequality to obtain |E|Xn | − E|X|| ≤ E|Xn − X|. For the second
claim use Theorem 1.2.8.
L1
(3) Define Un = Xn − X and Vn = |Xn | + |X|. Showing Xn −→ X will be equivalent (?)
to show lim supn→∞ E|Xn − X| = 0. Argue and use that lim supn→∞ |Xn − X| = 0 a.s.
(such a subsequence will always exist). Find a subsequence (n`k ) of (n` ) such that
a.s.
Xn`k −→ X .
Conclude from (3) that E|Xn`k − X| → 0 and use the uniqueness of this limit to derive
lim supn→∞ E|Xn − X| = 0.
(5) Use (4) and Theorem 5.4.5 to conclude that (Yn ) is uniformly integrable. Use (2) in
P P
Thm 5.4.7 (You might need that if Yn −→ Y and Yn −→ Z, then Y = Z a.s.).
(6) For ⇐, do as in 5 and use furthermore Thm 5.4.7 to conclude that Y closes.
For ⇒, use 5.4.7, 5.4.5, and (1).
(3) Use that if τ1 = n then we have lost the first n − 1 games and won the n’th games,
such that (like the example in the exercise)
n−1
X n−1
X
X1 = −1, X2 = −1 − 2, . . . , Xn−1 = 2k−1 , Xn = 2k−1 + 2n−1 = 1
k=1 k=1
(5) It may be useful to recall that (−Xn , Fn ) is a submartingale, and (−Xτk , Fτk ) is a
supermatingale.
Use that all these differences are independent and identically distributed.
(3) Use (2). Choose a fixed m such that P (|Sm | < b − a) < 1 and let n → ∞. The second
statement is trivial.
(4) For the first result, use optional sampling for the martingale (Sn , Fn ) and e.g. the
two bounded stopping times 1 and τa,b ∧ n. For the second result, let n → ∞ using
dominated (since Sτa,b ∧n is bounded) and monotone convergence.
(6) Apply the arguments from (4) to the martingale (Sn2 −n, Fn ). For the second statement,
use that the distribution of Sτa,b is well–known from (5).
(11) See that ESτb 6= ES1 and compare with Theorem 5.4.9.
(12) Write
∞
\
(sup Sn = ∞) = (τn < ∞)
n∈N n=1
(9) Use the almost sure convergence from (6) and Kronecker’s lemma.
(11) Use the result from (2) with Zm = Ym for all m ≥ 0. Use (10) to see that the
assumptions are fulfilled.
and then show that ED1 = σ(D2 ) using Lemma A.2.2. Note that you already have D2 ⊆
ED1 ⊆ σ(D2 ).
EA = {F ∈ F(D1 ) : P (F ∩ A) = P (F )P (A)} .
Hints for exercise 6.6. Show that the finite–dimensional distributions are the same. ◦
(1) Find H > 0 such that for all γ > 0 and all 0 ≤ t1 < · · · < tn
D
(Xγt1 , . . . , Xγtn ) = (γ H Xt1 , . . . , γ H Xtn ) .
(4) You need to show that for some t ≥ 0 and tn → t, then for all > 0
Use (2).
(3) For the first result use Markov’s inequality and that EU 4 = 3 if U ∼ N (0, 1).
Hints for exercise A.1. To obtain the supremum, use that weak inqeualities are preserved by
limits. ◦
Hints for exercise A.2. To obtain the supremum, use Lemma A.1.3 and the fact that Q is
dense in R. ◦
Hints for exercise A.3. Apply Fatou’s lemma to the sequence (g − fn )n≥1 . ◦
Index
P. Brémaud: Markov Chains: Gibbs fiends, Monte Carlo simulation and queues, Springer-
Verlag, 1999.
S. Meyn & R. L. Tweedie: Markov chains and stochastic stability, Cambridge University
Press, 2009.
SEMESTER - II
PROBABILITY &
STATISTICS
mca-5 230
1
PROBABILITY
INTRODUCTION TO PROBABILITY
Probability
Sets and Subsets
The lesson introduces the important topic of sets, a simple
idea that recurs throughout the study of probability and statistics.
Set Definitions
A set is a well-defined collection of objects.
Each object in a set is called an element of the set.
Two sets are equal if they have exactly the same elements
in them.
A set that contains no elements is called a null set or an
empty set.
If every element in Set A is also in Set B, then Set A is a
subset of Set B.
Set Notation
A set is usually denoted by a capital letter, such as A, B, or
C.
An element of a set is usually denoted by a small letter, such
as x, y, or z.
A set may be decribed by listing all of its elements enclosed
in braces. For example, if Set A consists of the numbers 2,
4, 6, and 8, we may say: A = {2, 4, 6, 8}.
The null set is denoted by {∅}.
mca-5 231
Sample Problems
1. Describe the set of vowels.
Yes. Two sets are equal if they have the same elements.
The order in which the elements are listed does not matter.
4. What is the set of men with four arms?
Since all men have two arms at most, the set of men with
four arms contains no elements. It is the null set (or empty
set).
5. Set A = {1, 2, 3} and Set B = {1, 2, 4, 5, 6}. Is Set A a subset
of Set B?
Statistical Experiments
All statistical experiments have three things in common:
The experiment can have more than one possible outcome.
Each possible outcome can be specified in advance.
mca-5 232
Sample Problems
1. Suppose I roll a die. Is that a statistical experiment?
A. {1}
B. {2, 4,}
C. {2, 4, 6}
D. All of the above
A. {1}
B. {2, 4,}
C. {2, 4, 6}
Two events are mutually exclusive, if they have no sample points in
common. Events A and B are mutually exclusive, and Events A and
C are mutually exclusive; since they have no points in common.
Events B and C have common sample points, so they are not
mutually exclusive.
6. Suppose you roll a die two times. Is each roll of the die an
independent event?
Basic Probability
The probability of a sample point is a measure of the likelihood
that the sample point will occur.
Probability of a Sample Point
By convention, statisticians have agreed on the following rules.
The probability of any sample point can range from 0 to 1.
The sum of probabilities of all sample points in a sample
space is equal to 1.
Example 1
Suppose we conduct a simple statistical experiment. We flip a coin
one time. The coin flip can have one of two outcomes - heads or
tails. Together, these outcomes represent the sample space of our
experiment. Individually, each outcome represents a sample point
in the sample space. What is the probability of each sample point?
Solution: The sum of probabilities of all the sample points must
equal 1. And the probability of getting a head is equal to the
probability of getting a tail. Therefore, the probability of each
sample point (heads or tails) must be equal to 1/2.
Example 2
Let's repeat the experiment of Example 1, with a die instead of a
coin. If we toss a fair die, what is the probability of each sample
point?
Probability of an Event
The probability of an event is a measure of the likelihood that the
event will occur. By convention, statisticians have agreed on the
following rules.
The probability of any event can range from 0 to 1.
The probability of event A is the sum of the probabilities of all
the sample points in event A.
The probability of event A is denoted by P(A).
Thus, if event A were very unlikely to occur, then P(A) would be
close to 0. And if event A were very likely to occur, then P(A) would
be close to 1.
Example 1
Suppose we draw a card from a deck of playing cards. What is the
probability that we draw a spade?
Example 2
Suppose a coin is flipped 3 times. What is the probability of getting
two tails and one head?
Solution: For this experiment, the sample space consists of 8
sample points.
S = {TTT, TTH, THT, THH, HTT, HTH, HHT, HHH}
Each sample point is equally likely to occur, so the probability of
getting any particular sample point is 1/8. The event "getting two
tails and one head" consists of the following subset of the sample
space.
A = {TTH, THT, HTT}
The probability of Event A is the sum of the probabilities of the
sample points in A. Therefore,
P(A) = 1/8 + 1/8 + 1/8 = 3/8
she might find that the probability that a visitor makes a purchase
gets closer and closer 0.20.
Solution
The correct answer is (D). If you toss a coin three times, there are a
total of eight possible outcomes. They are: HHH, HHT, HTH, THH,
HTT, THT, TTH, and TTT. Of the eight possible outcomes, three
have exactly one head. They are: HTT, THT, and TTH. Therefore,
the probability that three flips of a coin will produce exactly one
head is 3/8 or 0.375.
Rules of Probability
Often, we want to compute the probability of an event from the
known probabilities of other events. This lesson covers some
important rules that simplify those computations.
Probability Calculator
Rule of Subtraction
In a previous lesson, we learned two important properties of
probability:
The probability of an event ranges from 0 to 1.
The sum of probabilities of all possible events equals 1.
The rule of subtraction follows directly from these properties.
Rule of Multiplication
The rule of multiplication applies to the situation when we want to
know the probability of the intersection of two events; that is, we
want to know the probability that two events (Event A and Event B)
both occur.
Example
An urn contains 6 red marbles and 4 black marbles. Two marbles
mca-5 238
are drawn without replacement from the urn. What is the probability
that both of the marbles are black?
Solution: Let A = the event that the first marble is black; and let B =
the event that the second marble is black. We know the following:
In the beginning, there are 10 marbles in the urn, 4 of which
are black. Therefore, P(A) = 4/10.
After the first selection, there are 9 marbles in the urn, 3 of
which are black. Therefore, P(B|A) = 3/9.
Rule of Addition
The rule of addition applies to the following situation. We have two
events, and we want to know the probability that either event
occurs.
Example
A student goes to the library. The probability that she checks out (a)
a work of fiction is 0.40, (b) a work of non-fiction is 0.30, , and (c)
both fiction and non-fiction is 0.20. What is the probability that the
student checks out a work of fiction, non-fiction, or both?
Solution: Let F = the event that the student checks out fiction; and
let N = the event that the student checks out non-fiction. Then,
based on the rule of addition:
P(F U N) = P(F) + P(N) - P(F ∩ N)
P(F U N) = 0.40 + 0.30 - 0.20 = 0.50
Solution
The correct answer is A. Let A = the event that the first marble is
black; and let B = the event that the second marble is black. We
know the following:
In the beginning, there are 10 marbles in the urn, 4 of which
are black. Therefore, P(A) = 4/10.
After the first selection, we replace the selected marble; so
there are still 10 marbles in the urn, 4 of which are black.
Therefore, P(B|A) = 4/10.
Problem 2
A card is drawn randomly from a deck of ordinary playing cards.
You win $10 if the card is a spade or an ace. What is the probability
that you will win the game?
(A) 1/13
(B) 13/52
(C) 4/13
(D) 17/52
(E) None of the above.
Solution
The correct answer is C. Let S = the event that the card is a spade;
and let A = the event that the card is an ace. We know the
following:
There are 52 cards in the deck.
There are 13 spades, so P(S) = 13/52.
There are 4 aces, so P(A) = 4/52.
There is 1 ace that is also a spade, so P(S ∩ A) = 1/52.
Therefore, based on the rule of addition:
P(S U A) = P(S) + P(A) - P(S ∩ A)
P(S U A) = 13/52 + 4/52 - 1/52 = 16/52 = 4/13
P( Ak ∩ B )
P( Ak | B ) =
P( A1 ∩ B ) + P( A2 ∩ B ) + . . . + P( An ∩ B )
P( Ak ) P( B | Ak )
P( Ak | B ) =
P( A1 ) P( B | A1 ) + P( A2 ) P( B | A2 ) + . . . + P( An ) P(
B | An )
Sample Problem
Bayes' theorem can be best understood through an example. This
section presents an example that demonstrates how Bayes'
theorem can be applied effectively to solve statistical problems.
mca-5 241
Example 1
Marie is getting married tomorrow, at an outdoor ceremony in the
desert. In recent years, it has rained only 5 days each year.
Unfortunately, the weatherman has predicted rain for tomorrow.
When it actually rains, the weatherman correctly forecasts rain 90%
of the time. When it doesn't rain, he incorrectly forecasts rain 10%
of the time. What is the probability that it will rain on the day of
Marie's wedding?
P( A1 ) P( B | A1 )
P( A1 | B ) =
P( A1 ) P( B | A1 ) + P( A2 ) P( B | A2 )
P( A1 | B ) = (0.014)(0.9) / [ (0.014)(0.9) + (0.986)(0.1) ]
P( A1 | B ) = 0.111
Note the somewhat unintuitive result. When the weatherman
predicts rain, it actually rains only about 11% of the time. Despite
the weatherman's gloomy prediction, there is a good chance that
Marie will not get rained on at her wedding.
This is an example of something called the false positive paradox. It
illustrates the value of using Bayes theorem to calculate conditional
probabilities.
Probability
For an experiment we define an event to be any collection of
possible outcomes.
A simple event is an event that consists of exactly one outcome.
or: means the union i.e. either can occur
and: means intersection i.e. both must occur
mca-5 242
Example
Below is an example of two sets, A and B, graphed in a Venn
diagram.
The green area represents A and B while all areas with color
represent A or B
Example
Our Women's Volleyball team is recruiting for new members.
Suppose that a person inquires about the team.
mca-5 243
P(E and F)
P(E|F) =
P(F)
Example
Consider rolling two dice. Let
E be the event that the first die is a 3.
F be the event that the sum of the dice is an 8.
Then E and F means that we rolled a three and then we rolled a
5
This probability is 1/36 since there are 36 possible pairs and only
one of them is (3,5)
We have
P(E) = 1/6
And note that (2,6),(3,5),(4,4),(5,3), and (6,2) give F
Hence
P(F) = 5/36
We have
P(E) P(F) = (1/6) (5/36)
which is not 1/36.
We can conclude that E and F are not independent.
Exercise
Test the following two events for independence:
E the event that the first die is a 1.
F the event that the sum is a 7.
A Counting Rule
For two events, E and F, we always have
mca-5 244
Example
Find the probability of selecting either a heart or a face card from a
52 card deck.
Solution
We let
E = the event that a heart is selected
F = the event that a face card is selected
then
P(E) = 1/4 and P(F) = 3/13 (Jack, Queen, or King
out of 13 choices)
P(E and F) = 3/52
The formula gives
P(E or F) = 1/4 + 3/13 - 3/52 = 22/52 = 42%
Example
A native flowering plant has several varieties. The color of the
flower can be red, yellow, or white. The stems can be long or short
and the leaves can be thorny, smooth, or velvety. Show all
varieties.
Solution
We use a tree diagram. A tree diagram is a diagram that branches
out and ends in leaves that correspond to the final variety. The
picture below shows this.
Outcome
An outcome is the result of an experiment or other situation
involving uncertainty.
The set of all possible outcomes of a probability experiment is
called a sample space.
Sample Space
The sample space is an exhaustive list of all the possible outcomes
of an experiment. Each possible result of such a study is
represented by one and only one point in the sample space, which
is usually denoted by S.
mca-5 245
Examples
Experiment Rolling a die once:
Sample space S = {1,2,3,4,5,6}
Experiment Tossing a coin:
Sample space S = {Heads,Tails}
Experiment Measuring the height (cms) of a girl on her first day at
school:
Sample space S = the set of all possible real numbers
Event
An event is any collection of outcomes of an experiment.
Formally, any subset of the sample space is an event.
Any event which consists of a single outcome in the sample space
is called an elementary or simple event. Events which consist of
more than one outcome are called compound events.
Set theory is used to represent relationships among events. In
general, if A and B are two events in the sample space S, then
(A union B) = 'either A or B occurs or both occur'
(A intersection B) = 'both A and B occur'
(A is a subset of B) = 'if A occurs, so does B'
A' or = 'event A does not occur'
(the empty set) = an impossible event
S (the sample space) = an event that is certain to occur
Example
Experiment: rolling a dice once -
Sample space S = {1,2,3,4,5,6}
Events A = 'score < 4' = {1,2,3}
B = 'score is even' = {2,4,6}
C = 'score is 7' =
= 'the score is < 4 or even or both' = {1,2,3,4,6}
= 'the score is < 4 and even' = {2}
A' or = 'event A does not occur' = {4,5,6}
Relative Frequency
Relative frequency is another term for proportion; it is the value
calculated by dividing the number of times an event occurs by the
total number of times an experiment is carried out. The probability
of an event can be thought of as its long-run relative frequency
when the experiment is carried out many times.
If an experiment is repeated n times, and event E occurs r times,
then the relative frequency of the event E is defined to be
rfn(E) = r/n
Example
Experiment: Tossing a fair coin 50 times (n = 50)
Event E = 'heads'
Result: 30 heads, 20 tails, so r = 30
mca-5 246
P(E) = rfn(E)
For example, in the above experiment, the relative frequency of the
event 'heads' will settle down to a value of approximately 0.5 if the
experiment is repeated many more times.
Probability
Subjective Probability
A subjective probability describes an individual's personal
judgement about how likely a particular event is to occur. It is not
based on any precise computation but is often a reasonable
assessment by a knowledgeable person.
Example
A Rangers supporter might say, "I believe that Rangers have
probability of 0.9 of winning the Scottish Premier Division this year
since they have been playing really well."
Independent Events
Two events are independent if the occurrence of one of the events
gives us no information about whether or not the other event will
occur; that is, the events have no influence on each other.
Example
Suppose that a man and a woman each have a pack of 52 playing
cards. Each draws a card from his/her pack. Find the probability
that they each draw the ace of clubs.
We define the events:
A = probability that man draws ace of clubs = 1/52
B = probability that woman draws ace of clubs = 1/52
Clearly events A and B are independent so:
= 1/52 . 1/52 = 0.00037
That is, there is a very small chance that the man and the woman
will both draw the ace of clubs.
Conditional Probability
The usual notation for "event A occurs given that event B has
occurred" is "A | B" (A given B). The symbol | is a vertical line and
does not imply division. P(A | B) denotes the probability that event
A will occur given that event B has occurred already.
where:
P(A | B) = the (conditional) probability that event A will occur
given that event B has occured already
= the (unconditional) probability that event A and
event B both occur
P(B) = the (unconditional) probability that event B occurs
Example:
When a fair dice is tossed, the conditional probability of getting ‘1’ ,
given that an odd number has been obtained, is equal to 1/3 as
explained below:
S = {1,2,3,4,5,6}; A ={1,3,5};B={1};A B= {1}
P(B/A) =1/6 / ½ = 1/3
Example:
From a pack of cards,2 cards are drawn in succession one after the
other. After every draw, the selected card is not replaced. What is
the probability that in both the draws you will get spades?
Solution:
Let A = getting spade in the first draw
Let B = getting spade in the second draw.
The cards are not replaced.
This situation requires the use of conditional probability.
P(A)= 13/52
P(B/A)= 12/51
where:
P(A) = probability that event A occurs
P(B) = probability that event B occurs
= probability that event A or event B occurs
= probability that event A and event B both occur
For mutually exclusive events, that is events which cannot occur
together:
=0
The addition rule therefore reduces to
= P(A) + P(B)
For independent events, that is events which have no influence on
each other:
Example
Suppose we wish to find the probability of drawing either a king or a
spade in a single draw from a pack of 52 playing cards.
We define the events A = 'draw a king' and B = 'draw a spade'
Since there are 4 kings in the pack and 13 spades, but 1 card is
both a king and a spade, we have:
= 4/52 + 13/52 - 1/52 = 16/52
So, the probability of drawing either a king or a spade is 16/52 (=
4/13).
See also multiplication rule.
Multiplication Rule
The multiplication rule is a result used to determine the probability
that two events, A and B, both occur.
mca-5 250
where:
P(A) = probability that event A occurs
P(B) = probability that event B occurs
= probability that event A and event B occur
P(A | B) = the conditional probability that event A occurs
given that event B has occurred already
P(B | A) = the conditional probability that event B occurs
given that event A has occurred already
That is, the probability of the joint events A and B is equal to the
product of the individual probabilities for the two events.
Multiplication rule for independent events:
Example:
The probability that you will get an A grade in Quantitative methods
is 0.7.The probability that you will get an A grade in Marketing is
0.5.Assuming these two courses are independent, compute the
probability that you will get an A grade in both these subjects.
Solution:
Let A = getting A grade in quantitative methods
Let B= getting A grade in Marketing
It is given that a and B are independent.
Applying the formula,
We get, P(A and B) = P(A).P(B)=.7*.5=.35
Conditional Probability
In many situations, once more information becomes available, we
are able to revise our estimates for the probability of further
outcomes or events happening. For example, suppose you go out
for lunch at the same place and time every Friday and you are
served lunch within 15 minutes with probability 0.9. However, given
that you notice that the restaurant is exceptionally busy, the
probability of being served lunch within 15 minutes may reduce to
0.7. This is the conditional probability of being served lunch within
15 minutes given that the restaurant is exceptionally busy.
The usual notation for "event A occurs given that event B has
occurred" is "A | B" (A given B). The symbol | is a vertical line and
does not imply division. P(A | B) denotes the probability that event
A will occur given that event B has occurred already.
mca-5 251
where:
P(A | B) = the (conditional) probability that event A will occur
given that event B has occured already
= the (unconditional) probability that event A and
event B both occur
P(B) = the (unconditional) probability that event B occurs
where:
P(A) = probability that event A occurs
= probability that event A and event B both occur
= probability that event A and event B' both occur,
i.e. A occurs and B does not.
Using the multiplication rule, this can be expressed as
P(A) = P(A | B).P(B) + P(A | B').P(B')
Bayes' Theorem
Bayes' Theorem is a result that allows new information to be used
to update the conditional probability of an event.
Using the multiplication rule, gives Bayes' Theorem in its simplest
form:
Example 1:
The Monty Hall problem
We are presented with three doors - red, green, and blue - one of
which has a prize. We choose the red door, which is not opened
until the presenter performs an action. The presenter who knows
what door the prize is behind, and who must open a door, but is not
permitted to open the door we have picked or the door with the
prize, opens the green door and reveals that there is no prize
behind it and subsequently asks if we wish to change our mind
about our initial selection of red. What is the probability that the
prize is behind the blue and red doors?
Let us call the situation that the prize is behind a given door Ar, Ag,
and Ab.
To start with, , and to make things simpler we shall assume that we
have already picked the red door.
Let us call B "the presenter opens the green door". Without any
prior knowledge, we would assign this a probability of 50%
In the situation where the prize is behind the red door, the
host is free to pick between the green or the blue door at
random. Thus, P(B | Ar) = 1 / 2
In the situation where the prize is behind the green door, the
host must pick the blue door. Thus, P(B | Ag) = 0
In the situation where the prize is behind the blue door, the
host must pick the green door. Thus, P(B | Ab) = 1
Thus,
Note how this depends on the value of P(B).
SOLVE:
1.In a software test environment holding software developed on
j2ee specification a down time analysis was done. Based on the
100 earlier records it was found that there is about 5% downtime
per day. A study on the components involved in the environment
shows that a problem in web sphere cause errors out of which 25%
led to downtime. If there are issues in the operating system, 40% of
the issues lead to a down time and again 20% of the problems in
network led to a downtime. Given that there is a downtime find the
probability that each of the above reason could have contributed
the downtime between themselves (considering just these 3
reasons)
Solutions:
Let the occurrence of a downtime be D
P(D) =5%=.05
Let the occurrence of a web sphere error be W
mca-5 254
Summary:
This unit provides a conceptual framework on probability concepts
with examples. Specifically this unit is focused on:
The meaning and definition of the term probability interwoven
with other associated terms-event, experiment and sample
space.
The three types of probability –classical probability, statistical
probability and subjective probability.
The concept of mutually exclusive events and independent
events.
The rules for calculating probability which include addition rue
for mutually exclusive events and non mutually exclusive
events, multiplication rule for independent events and
dependent events and conditional probability.
The application of Bayer’s theorem in management.
e) swirled marble
f)
2. Two persons X and Y appear in an interview for two
vacancies in the same post, the probability of X’s selection is
1/5 and that of Y’s selection is 1/3.What is the probability
that:1)both X and Y will be selected?2)Only one of them will
be selected?3)none of them will be selected?
a)if the wife is watching television ,the husbands are also watching
television.
b)the wives are watching television in prime time.
7)In the past several years, credit card companies have made an
aggressive effort to solicit new accounts from college students.
Suppose that a sample of 200 students at your college indicated
the following information as to whether the student possessed a
bank credit card and/or a travel and entertainment credit card.
mca-5 257
2
PROBABILITY DISTRIBUTION
INTRODUCTION
LEARNING OBJECTIVES
RANDOM VARIABLE
Problem 1
mca-5 259
Solution 1
Yes. Specifically it should be considered as a continuous random
variable as the power of any signal attenuates through a
transmission line. The attenuation factors associated with each
transmission line are only approximate. Thus the power received
from the antenna can take any value.
Measures of location
Location
A fundamental task in many statistical analyses is to estimate a
location parameter for the distribution; i.e., to find a typical or
central value that best describes the data.
Definition of location
1. mean - the mean is the sum of the data points divided by the
number of data points. That is,
2. median - the median is the value of the point which has half
the data smaller than that point and half the data larger than
that point. That is, if X1, X2, ... ,XN is a random sample sorted
mca-5 260
Normal distribution
mca-5 261
Exponential distribution
The second histogram is a sample from an exponential distribution.
The mean is 1.001, the median is 0.684, and the mode is 0.254
(the mode is computed as the midpoint of the histogram interval
with the highest peak).
Cauchy distribution
The third histogram is a sample from a Cauchy distribution. The
mean is 3.70, the median is -0.016, and the mode is -0.362 (the
mode is computed as the midpoint of the histogram interval with the
highest peak).
For better visual comparison with the other data sets, we restricted
the histogram of the Cauchy distribution to values between -10 and
10. The full Cauchy data set in fact has a minimum of
approximately -29,000 and a maximum of approximately 89,000.
distribution of the original data. This means that for the Cauchy
distribution the mean is useless as a measure of the typical value.
For this histogram, the mean of 3.7 is well above the vast majority
of the data. This is caused by a few very extreme values in the tail.
However, the median does provide a useful measure for the typical
value.
Robustness
There are various alternatives to the mean and median for
measuring location. These alternatives were developed to address
non-normal data since the mean is an optimal estimator if in fact
your data are normal.
Definition of skewness
For univariate data Y1, Y2, ..., YN, the formula for skewness is:
Definition of kurtosis
For univariate data Y1, Y2, ..., YN, the formula for kurtosis is:
Examples
The following example shows histograms for 10,000 random
numbers generated from a normal, a double exponential, a Cauchy,
and a Weibull distribution.
Probability distribution:
Solution:
a)p(x>5) =p(6)+p(8)=.08+.04+.03=.15
mca-5 266
b)p(x=0)=.20
c)p(x<4)=p(0)+p(1)+p(2)+p(3)=.20+.18+.16+.12=.66
1. Discrete Distributions
Discrete Densities
a. f(x) 0 for x in S.
b. x in S f(x) = 1
c. x in A f(x) = P(X A) for A S.
Interpretation
Examples
2. Suppose that two fair dice are tossed and the sequence of
scores (X1, X2) recorded. Find the density function of
a. (X1, X2)
b. Y = X1 + X2, the sum of the scores
c. U = min{X1, X2}, the minimum score
d. V = max{X1, X2}, the maximum score
e. (U, V)
f(i1, i2, ..., in) = pk(1 - p)n - k for ij in {0, 1} for each j, where k = i1 +
i2 + ··· + in.
17. In the die-coin experiment, a fair die is rolled and then a fair
coin is tossed the number of times shown on the die. Let I denote
the sequence of coin results (0 for tails, 1 for heads). Find the
density of I (note that I takes values in a set of sequences of
varying lengths).
Constructing Densities
c= x in S g(x).
22. Let g(x, y) = xy for (x, y) {(1, 1), (1, 2), (1, 3), (2, 2), (2, 3),
(3, 3)}.
Conditional Densities
27. A pair of fair dice are rolled. Let Y denote the sum of the
scores and U the minimum score. Find the conditional density of U
given Y = 8.
28. Run the dice experiment 200 times, updating after every run.
Compute the empirical conditional density of U given Y = 8 and
compare with the conditional density in the last exercise.
31. In the die-coin experiment, a fair die is rolled and then a fair
coin is tossed the number of times showing on the die.
32. Run the die-coin experiment 200 times, updating after each
run.
33. Suppose that a bag contains 12 coins: 5 are fair, 4 are biased
with probability of heads 1/3; and 3 are two-headed. A coin is
chosen at random from the bag and tossed twice.
37. In the M&M data, let R denote the number of red candies and
N the total number of candies. Compute and graph the empirical
density of
a. R
b. N
c. R given N > 57.
mca-5 274
a. G
b. S
c. (G, S)
d. G given W > 0.20 grams.
Introduction:
Probability Distribution:
First toss.........T T T T H H H H
Second toss.....T T H H T T H H
Third toss........T H T H T H T H
X.....................................P(X)..........................(X)P(X)
0......................................1/8................................0.0
1......................................3/8................................0.375
2......................................3/8................................0.75
mca-5 275
3......................................1/8................................0.375
Total.....................................................................1.5 = E(X)
Example:
X:..... $1......$2........$5......$10.........$15......$20
P(X)....0.1.....0.2.......0.3.......0.2..........0.15.....0.05
(1)......(2).......(3).............(4)..................(5).........................................
.(6)
X......P(X)....X.P(X).......X - mean......[(X -
mean)]squared...............(5)x(2)
1.......0.1......0.1...........- 6.25............. . 39.06..
...................................... 3.906
2.......0.2......0.4......... - 5.25.............. . 27.56
..................................... ..5.512
5.......0.3......1.5...........- 2.25.............. . .5.06.
...................................... 1.518
10.....0.2......2.0............ .2.75.............. 7.56..
...................................... 1.512
15.....0.15....2.25...........7.75...............60.06......................................
. . 9.009
20.....0.05....1.0...........12.75.............162.56......................................
... 8.125
Total...........7.25 =
E(X)....................................................................29.585
Thus, the expected value is $7.25, and standard deviation is the
square root of $29.585, which is equal to $5.55. In other words, an
average donor is expected to donate $7.25 with a standard
deviation of $5.55.
Binomial Distribution:
the acceptable sample size for using the binomial distribution with
samples taken without replacement is [n<5% n] where n is equal to
the sample size, and N stands for the size of the population. The
birth of children (male or female), true-false or multiple-choice
questions (correct or incorrect answers) are some examples of the
binomialdistribution.
BinomialEquation:
Example:
Example:
Example:
Solution:
=p(x=0)
=(0.9)20
b)p(x<=3)=p(x=0)+p(x=1)+p(x=2)+p(x=3)
=.8671
c)p(x=3)=.1901
mca-5 280
Binomial Distribution
Binomial Experiment
A binomial experiment (also known as a Bernoulli trial) is a
statistical experiment that has the following properties:
Notation
The following notation is helpful, when we talk about binomial
probability.
Binomial Distribution
A binomial random variable is the number of successes x in n
repeated trials of a binomial experiment. The probability distribution
of a binomial random variable is called a binomial distribution
(also known as a Bernoulli distribution).
Suppose we flip a coin two times and count the number of heads
(successes). The binomial random variable is the number of heads,
which can take on values of 0, 1, or 2. The binomial distribution is
presented below.
Binomial Probability
The binomial probability refers to the probability that a binomial
experiment results in exactly x successes. For example, in the
above table, we see that the binomial probability of getting exactly
one head in two coin flips is 0.50.
Example 1
b(x < 45; 100, 0.5) = b(x = 0; 100, 0.5) + b(x = 1; 100, 0.5) + ... +
b(x = 44; 100, 0.5) + b(x = 45; 100, 0.5)
Binomial Calculator
As you may have noticed, the binomial formula requires many time-
consuming computations. The Binomial Calculator can do this work
for you - quickly, easily, and error-free. Use the Binomial Calculator
to compute binomial probabilities and cumulative binomial
probabilities. The calculator is free. It can be found under the Stat
Tables menu item, which appears in the header of every Stat Trek
web page.
Binomial
Calculator
Example 1
b(x < 45; 100, 0.5) = b(x = 0; 100, 0.5) + b(x = 1; 100, 0.5) + . . . +
b(x = 45; 100, 0.5)
b(x < 45; 100, 0.5) = 0.184
Example 2
Example 3
What is the probability that the world series will last 4 games? 5
games? 6 games? 7 games? Assume that the teams are evenly
matched.
In the world series, there are two baseball teams. The series ends
when the winning team wins 4 games. Therefore, we define a
success as a win by the team that ultimately becomes the world
series champion.
For the purpose of this analysis, we assume that the teams are
evenly matched. Therefore, the probability that a particular team
wins a particular game is 0.5.
Let's look first at the simplest case. What is the probability that the
series lasts only 4 games. This can occur if one team wins the first
4 games. The probability of the National League team winning 4
games in a row is:
Now let's tackle the question of finding probability that the world
series ends in 5 games. The trick in finding this solution is to
recognize that the series can only end in 5 games, if one team has
won 3 out of the first 4 games. So let's first find the probability that
the American League team wins exactly 3 of the first 4 games.
Okay, here comes some more tricky stuff, so listen up. Given that
the American League team has won 3 of the first 4 games, the
American League team has a 50/50 chance of winning the fifth
game to end the series. Therefore, the probability of the American
League team winning the series in 5 games is 0.25 * 0.50 = 0.125.
Since the National League team could also win the series in 5
games, the probability that the series ends in 5 games would be
0.125 + 0.125 = 0.25.
The rest of the problem would be solved in the same way. You
should find that the probability of the series ending in 6 games is
0.3125; and the probability of the series ending in 7 games is also
0.3125.
While this is statistically correct in theory, over the years the actual
world series has turned out differently, with more series than
expected lasting 7 games. For an interesting discussion of why
world series reality differs from theory, see Ben Stein's explanation
of why 7-game world series are more common than expected.
Notation
The following notation is helpful, when we talk about negative
binomial probability.
2 0.25
3 0.25
4 0.1875
5 0.125
6 0.078125
7 or more 0.109375
Geometric Distribution
The geometric distribution is a special case of the negative
binomial distribution. It deals with the number of trials required for a
single success. Thus, the geometric distribution is negative
binomial distribution where the number of successes (r) is equal to
1.
g(x; P) = P * Qx - 1
Sample Problems
The problems below show how to apply your new-found knowledge
of the negative binomial distribution (see Example 1) and the
geometric distribution (see Example 2).
Negative Binomial
Calculator
Example 1
b*(x; r, P) = x-1Cr-1 * Pr * Qx - r
Thus, the probability that Bob will make his third successful free
throw on his fifth shot is 0.18522.
Example 2
Let's reconsider the above problem from Example 1. This time, we'll
ask a slightly different question: What is the probability that Bob
makes his first free throw on his fifth shot?
b*(x; r, P) = x-1Cr-1 * Pr * Qx - r
g(x; P) = P * Qx - 1
Hypergeometric Distribution
Hypergeometric Experiments
A hypergeometric experiment is a statistical experiment that has
the following properties:
Note further that if you selected the marbles with replacement, the
probability of success would not change. It would be 5/10 on every
trial. Then, this would be a binomial experiment.
Notation
mca-5 290
Hypergeometric Distribution
A hypergeometric random variable is the number of successes
that result from a hypergeometric experiment. The probability
distribution of a hypergeometric random variable is called a
hypergeometric distribution.
Example 1
Hypergeometric Calculator
As you surely noticed, the hypergeometric formula requires many
time-consuming computations. The Stat Trek Hypergeometric
Calculator can do this work for you - quickly, easily, and error-free.
Use the Hypergeometric Calculator to compute hypergeometric
probabilities and cumulative hypergeometric probabilities. The
calculator is free. It can be found under the Stat Tables menu item,
which appears in the header of every Stat Trek web page.
Hypergeometric
Calculator
Example 1
Multinomial Distribution
Multinomial Experiment
A multinomial experiment is a statistical experiment that has the
following properties:
Multinomial Distribution
A multinomial distribution is the probability distribution of the
outcomes from a multinomial experiment. The multinomial formula
defines the probability of any outcome from a multinomial
experiment.
where n = n1 + n1 + . . . + nk.
Multinomial Calculator
As you may have noticed, the multinomial formula requires many
time-consuming computations. The Multinomial Calculator can do
this work for you - quickly, easily, and error-free. Use the
Multinomial Calculator to compute the probability of outcomes from
multinomial experiments. The calculator is free. It can be found
under the Stat Tables menu item, which appears in the header of
every Stat Trek web page.
Multinomial
Calculator
mca-5 294
Example 2
Poisson Distribution
Note that the specified region could take many forms. For instance,
it could be a length, an area, a volume, a period of time, etc.
Notation
mca-5 296
Poisson Distribution
A Poisson random variable is the number of successes that result
from a Poisson experiment. The probability distribution of a Poisson
random variable is called a Poisson distribution.
Example 1
Poisson Calculator
Clearly, the Poisson formula requires many time-consuming
computations. The Stat Trek Poisson Calculator can do this work
for you - quickly, easily, and error-free. Use the Poisson Calculator
to compute Poisson probabilities and cumulative Poisson
probabilities. The calculator is free. It can be found under the Stat
Tables menu item, which appears in the header of every Stat Trek
web page.
Poisson
Calculator
Example 1
z = (X - μ) / σ
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-
0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
3.0
... ... ... ... ... ... ... ... ... ... ...
-
0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0722 0.0708 0.0694 0.0681
1.4
-
0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
1.3
-
0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
1.2
... ... ... ... ... ... ... ... ... ... ...
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
Find P(Z > a). The probability that a standard normal random
variable (z) is greater than a given value (a) is easy to find.
The table shows the P(Z < a). The P(Z > a) = 1 - P(Z < a).
Find P(a < Z < b). The probability that a standard normal
random variables lies between two values is also easy to
find. The P(a < Z < b) = P(Z < b) - P(Z < a).
Transform raw data. Usually, the raw data are not in the form
of z-scores. They need to be transformed into z-scores,
using the transformation equation presented earlier: z = (X -
μ) / σ.
The problem in the next section demonstrates the use of the normal
distribution as a model for measurement.
(A) 0.10
(B) 0.18
(C) 0.50
(D) 0.82
(E) 0.90
Solution
mca-5 301
Introduction:
Figure 1
Example:
Figure 2
The probability of the variables falling between any two points, such
as c and d in figure 2, are calculated as follows:
P (c <= x <="d)" c)/(b a))=?
In this example c=a=150, d=160, and b=200, therefore:
Mean = (a + b)/2 = (150 + 200)/2 = 175 millimeters, standard
deviation is the square root of 208.3, which is equal to 14.43
millimeters, and P(c <= x <="d)" 150)/(200 150)="1/5" thus, of all
the sheets made by this machine, 20% of the production must be
scrapped.)=....
Figure 3
mca-5 305
Note that the integral calculus is used to find the area under the
normal distribution curve. However, this can be avoided by
transforming all normal distribution to fit the standard normal
distribution. This conversion is done by rescalling the normal
distribution axis from its true units (time, weight, dollars, and...) to a
standard measure called Z score or Z value. A Z score is the
number of standard deviations that a value, X, is away from the
mean. If the value of X is greater than the mean, the Z score is
positive; if the value of X is less than the mean, the Z score is
negative. The Z score or equation is as follows:
Example One:
Figure 4
Figure 5
mca-5 307
Figure 6
Figure 7
In this problem, the two values fall on the same side of the mean.
The Z scores are: Z1 = (330 - 476)/107 = -1.36, and Z2 = (440 -
476)/107 = -0.34. The probability associated with Z = -1.36 is
0.4131, and the probability associated with Z = -0.34 is 0.1331. The
rule is that when we want to find the probability between two values
of X on one side of the mean, we just subtract the smaller area
mca-5 308
from the larger area to get the probability between the two values.
Thus, the answer to this problem is: 0.4131 - 0.1331 = 0.28 or 28%.
Example Two:
Figure 8
Refer to the standard normal distribution table and search the body
of the table for 0.45. Since the exact number is not found in the
table, search for the closest number to 0.45. There are two values
equidistant from 0.45-- 0.4505 and 0.4495. Move to the left from
these values, and read the Z scores in the margin, which are: 1.65
and 1.64. Take the average of these two Z scores, i.e., (1.65 +
1.64)/2 = 1.645. Plug this number and the values of the mean and
the standard deviation into the Z equation, you get:
Z =(X - mean)/standard deviation or -1.645 =(X - 47,900)/2,050 =
44,528 miles.
Thus, the factory should set the guaranteed mileage at 44,528
miles if the objective is not to replace more than 5% of the tires.
Example:
Figure 9
Correction Factor:
mca-5 310
Expectation value
Uniform distribution
a)he has to wait for less than 6 minutes if he arrives between 9:09
and 9:15 or between 9:24 and 9:30.
=2/5
Solution:
Here n=50; Bernoulli trials p=.02
Let X be the total number of non conformal chips
mca-5 312
P(x>2)=1-p(x<=2)
=1-{p(x=0)+p(x=1)+p(x=2)}
=1-[(.98) + 50(.02)(.98) +1225(.02)(.98) ]
=1-.922=.078
thus the probability that the process is stopped on any day, based
on the symphony process is approximately 0.078.
Summary
This unit is extremely important from the point of view of many
fascinating aspects of statistical inference that would follow in the
subsequent units.Cetrainy, it is expected from you that you master
the nitty-gritty of this unit. This unit specifically focuses on
The definition, meaning and concepts of a probability
distribution.
The related terms-discrete random variable and continuous
random variable.
Discrete probability distribution and continuous probability
distribution.
The binomial distribution and its role in business problems.
The Poisson distribution and its uses
The normal distribution and its role in statistical inference.
The concept of the standard normal distribution and its role.
mca-5 314
3
JOINT PROBABILITY
A joint probability table is a table in which all possible events
(or outcomes)for one variable are listed as row headings, all
possible events for a second variable are listed as column
headings, and the value entered in each cell of the table is the
probability of each joint occurrence. Often the probabilities in such
a table are based on observed frequencies of occurrence for the
various joint events, rather than being a priori in nature. The table
of joint occurrence frequencies which can serve as the basis for
constructing a joint probability table is called a contingency table.
Table 1 is a contingency table which describes 200 people who
entered a clothing store according to sex and age, while table 1b is
the associated joint probability table. The frequency reported in
each cell of the contingency table is converted into a probability
value by dividing by the total number of observations, in this case,
200
Sex Total
Age Male Female
Under 30 60 50 110
30 and
over 80 10 90
Total 140 60 200
Sex
Age Male Female Total
under 30 0.3 0.25 0.55
30 and
over 0.4 0.05 0.45
Marginal
probability 0.7 0.3 1
reaction
in
party affiliation favour neutral opposed toal
Democratic(d) 120 20 20 160
Republican® 50 30 60 140
Independent(i) 50 10 40 100
total 220 60 120 400
See table 2b
Joint probability table for voter reactions to a new property tax plan
reaction
in Marginal
party affiliation favour neutral opposed probability
Democratic(d) .30 .05 .05 .40
Republican® .125 .075 .15 .35
Independent(i) .125 .025 .10 .25
total .55 .15 .30 1.00
d)p(I and f)
e)p(o/r),(f)p(r/o)
g)p(r or d)
h)p(d or f)
solution
a)=.30 (the marginal probability)
b)=.15(joint probability)
c)=.25 (maginal probability)
Correlation, Regression
Introduction
At this point, you know the basics, how to look at data, compute
and interpret probabilities draw a random sample, and to do
statistical inference. Now it’s a question of applying these concepts
to see the relationships hidden within the more complex situations
of real life. This unit shows you how statistics can summarize the
relationships between two factors based on a bivariate data set with
two columns of numbers. The correlation will tell you how strong
the relationship is, and regression will help you predict one factor
from the other.
Learning objectives
After reading this unit,you will be able to:
Define correlation coefficient with its properties
Calculate correlation coefficient and interpret
Appreciate the role of regression
Formulate the regression equation and use it for estimation and
prediction.
Correlation analysis
Mathematical properties
where and are the sample means of X and Y , sx and sy are the
sample standard deviations of X and Y and the sum is from i = 1
to n. As with the population correlation, we may rewrite this as
Caution: This method only works with centered data, i.e., data
which have been shifted by the sample mean so as to have an
average of zero. Some practitioners prefer an uncentered (non-
Pearson-compliant) correlation coefficient. See the example below
for a comparison.
By the usual procedure for finding the angle between two vectors
(see dot product), the uncentered correlation coefficient is:
as expected.
Correlation matrices
Removing correlation
Here is a simple example: hot weather may cause both crime and
ice-cream purchases. Therefore crime is correlated with ice-cream
purchases. But crime does not cause ice-cream purchases and ice-
cream purchases do not cause crime.
sum_sq_x = 0
sum_sq_y = 0
sum_coproduct = 0
mean_x = x[1]
mean_y = y[1]
for i in 2 to N:
sweep = (i - 1.0) / i
delta_x = x[i] - mean_x
delta_y = y[i] - mean_y
sum_sq_x += delta_x * delta_x * sweep
sum_sq_y += delta_y * delta_y * sweep
sum_coproduct += delta_x * delta_y * sweep
mean_x += delta_x / i
mean_y += delta_y / i
pop_sd_x = sqrt( sum_sq_x / N )
pop_sd_y = sqrt( sum_sq_y / N )
cov_x_y = sum_coproduct / N
mca-5 325
Autocorrelation
Uses of correlation
Regression analysis
Applications
Correlation Coefficient
Coefficient of Determination, r 2 or R2 :
Correlation Coefficient
r = 1 - (6 d2 ) / n(n2 - 1)
B 15 29.3
C 24 37.6
D 30 36.2
E 38 36.5
F 46 35.3
G 53 36.2
H 60 44.1
I 64 44.8
J 76 47.2
These figures form the basis for the scatter diagram, below, which
shows a reasonably strong positive correlation - the older the car,
the longer the stopping distance.
D 30 36.2 4 4.5
E 38 36.5 5 6
F 46 35.3 6 3
G 53 36.2 7 4.5
H 60 44.1 8 8
I 64 44.8 9 9
J 76 47.2 10 10
Notice that the ranking is done here in such a way that the youngest
car and the best stopping performance are rated top and vice versa.
There is no strict rule here other than the need to be consistent in
your rankings. Notice also that there were two values the same in
terms of the stopping performance of the cars tested. They occupy
'tied ranks' and must share, in this case, ranks 4 and 5. This means
they are each ranked as 4.5, which is the mean average of the two
ranking places. It is important to remember that this works despite
the number of items sharing tied ranks. For instance, if five items
shared ranks 5, 6, 7, 8 and 9, then they would each be ranked 7 -
the mean of the tied ranks.
Note that the two extra columns introduced into the new table are
Column 6, 'd', the difference between stopping distance rank and
mca-5 333
What does this tell us? When interpreting the Spearman Rank
Correlation Coefficient, it is usually enough to say that:
x y x2 y2 xy
9 28.4 81 806.56 255.6
15 29.3 225 858.49 439.5
24 37.6 576 1413.76 902.4
mca-5 334
Illustration:
When you have entered the information, select the three number
columns (do not include any cells with words in them). Go to the
Data Analysis option on the Tools menu, select from that Data
Analysis menu the item Correlation (note that if the Data Analysis
option is not on the Tools menu you have to add it in).
When you get the Correlation menu, enter in the first Input Range
box the column of cells containing the dependent variables you wish
to analyze (D3 to D12 if your spreadsheet looks like TimeWeb's).
Next, enter into the second input box the column of cells that contain
the independent variables (B3 to B12, again if your sheet resembles
TimeWeb's).
Then click the mouse pointer in the circle to the left of the Output
Range label (unless there is a black dot in it already), and click the
left mouse button in the Output Range box. Then enter the name of
cell where you want the top left corner of the correlation table to
appear (e.g., $A$14). Then click OK.
mca-5 336
Expected Answer:
The correlation between the Entry Mark and the Final Mark is 0.23;
the correlation between the four week test and the Final Mark is
0.28. Thus, both of the tests have a positive correlation to the Final
(6 month) Test; the entry test has a slightly weaker positive
correlation with the Final Mark, than the Four Week Test. However,
both figures are so low, that the correlation is minimal. The skills
measured by the Entry test account for about 5 per cent of the skills
measured by the Six Month Mark. This figure is obtained by using
the R-Squared result and expressing it as a percentage.
Beware!
It's vital to remember that a correlation, even a very strong one, does
not mean we can make a conclusion about causation. If, for
example, we find a very high correlation between the weight of a
baby at birth and educational achievement at age 25, we may make
some predictions about the numbers of people staying on at
university to study for post-graduate qualifications. Or we may urge
mothers-to-be to take steps to boost the weight of the unborn baby,
because the heavier their baby the higher their baby's educational
potential, but we should be aware that the correlation, in itself, is no
proof of these assertions.
AIDS. This has caused many researchers to believe that HIV is the
principal cause of AIDS. This belief has led to most of the money for
AIDS research going into investigating HIV.
But the cause of AIDS is still not clear. Some people (especially, not
surprisingly, those suffering from AIDS) have argued vehemently
that investigating HIV instead of AIDS is a mistake. They say that
something else is the real cause. This is the area, they argue, that
requires greater research funding. More money should be going into
AIDS research rather than studies into HIV.
r = [ 1 / (n - 1) ] * Σ { [ (xi - x) / sx ] * [ (yi - y) / sy ] }
Problem 1
(A) I only
(B) II only
(C) III only
(D) I and II only
(E) I, II, and III
Solution
The correct answer is (E). The correlation between car weight and
reliability is negative. This means that reliability tends to decrease
as car weight increases. The correlation between car weight and
maintenance cost is positive. This means that maintenance costs
tend to increase as car weight increases.
Least Squares
(see the vertical line in the figure) Since residuals can be positive or
negative, we will square them to remove the sign. By adding up all
of the squared residuals, we get a measure of how far away from
the data our line is. Thus, the ``best'' line will be one which has the
minimum sum of squared residuals, i.e., min . This method
of finding a line is called least squares.
The formulas for the slope and intercept of the least squares line
are
Prediction
Linear regression finds the straight line, called the least squares
regression line or LSRL, that best represents observations in a
bivariate data set. Suppose Y is a dependent variable, and X is an
independent variable. The population regression line is:
Y = Β0 + Β1X
ŷ = b0 + b1x
In the unlikely event that you find yourself on a desert island without
a computer or a graphing calculator, you can solve for b0 and b1 "by
hand". Here are the equations.
The least squares regression line is the only straight line that has
all of these properties.
Standard Error
mca-5 345
Problem 1
Solution
Regression Equation
Problem Statement
Last year, five randomly selected students took a math aptitude test
before they began their statistics course. The Statistics Department
has three questions.
In the table below, the xi column shows scores on the aptitude test.
Similarly, the yi column shows statistics grades. The last two rows
show sums and mean scores that we will use to conduct the
regression analysis.
1 95 85 17 8 289 64 136
2 85 95 7 18 49 324 126
3 80 70 2 -7 4 49 -14
4 70 65 -8 -12 64 144 96
5 60 70 -18 -7 324 49 126
Sum 390 385 730 630 470
Mean 78 77
Whenever you use a regression equation, you should ask how well
the equation fits the data. One way to assess fit is to check the
coefficient of determination, which can be computed from the
following formula.
σx = sqrt [ Σ ( xi - x )2 / N ] σy = sqrt [ Σ ( yi - y )2 / N ]
σx = sqrt( 730/5 ) = sqrt(146) = σy = sqrt( 630/5 ) = sqrt(126) =
12.083 11.225
Regression Line
Multiple Regression
Nonlinear Regression
Residual
Residuals
Stepwise Regression
Transformation to Linearity
x - 14 x - 14
Residuals
Both the sum and the mean of the residuals are equal to zero. That
is, Σ e = 0 and e = 0.
Residual Plots
Below the table on the left summarizes regression results from the
from the example presented in a previous lesson, and the chart on
the right displays those results as a residual plot. The residual plot
shows a non-random pattern - negative residuals on the low end of
the X axis and positive residuals on the high end. This indicates
that a non-linear model will provide a much better fit to the data. Or
it may be possible to "transform" the data to allow us to use a linear
model. We discuss linear transformations in the next lesson.
mca-5 352
x 95 85 80 70 60
y 85 95 70 65 70
81.5 78.2
ŷ 87.95 71.84 65.41
1 9
- -
10.9
e 4.51 1.29 5.15 11.5
5
9 9
Below, the residual plots show three typical patterns. The first plot
shows a random pattern, indicating a good fit for a linear model.
The other plot patterns are non-random (U-shaped and inverted U),
suggesting a better fit for a non-linear model.
Outliers
Data points that diverge from the overall pattern and have large
residuals are called outliers.
Outliers limit the fit of the regression equation to the data. This is
illustrated in the scatterplots below. The coefficient of determination
is bigger when the outlier is not present.
Influential Points
mca-5 353
Influential points are data points with extreme values that greatly
affect the the slope of the regression line.
The charts below compare regression statistics for a data set with
and without an influential point. The chart on the right has a single
influential point, located at the high end of the X axis (where x =
24). As a result of that single influential point, the slope of the
regression line increases dramatically, from -2.5 to -1.6.
Note that this influential point, unlike the outliers discussed above,
did not reduce the coefficient of determination. In fact, the
coefficient of determination was bigger when the influential point
was present.
I. When the sum of the residuals is greater than zero, the model
is nonlinear.
II. Outliers reduce the coefficient of determination.
III. Influential points reduce the correlation coefficient.
(A) I only
(B) II only
(C) III only
(D) I and II only
(E) I, II, and III
Solution
Regression Predicted
Method Transformation(s)
equation value (ŷ)
Standard linear
None y = b0 + b1x ŷ = b0 + b1x
regression
Exponential Dependent variable log(y) = b0 +
ŷ = 10b0 + b1x
model = log(y) b1x
Quadratic Dependent variable sqrt(y) = b0 + ŷ = ( = b0 +
model = sqrt(y) b1x b1x )2
mca-5 355
Note: The logarithmic model and the power model require the
ability to work with logarithms. Use a graphic calculator to obtain
the log of a number or to transform back from the logarithm to the
original number. If you need it, the Stat Trek glossary has a brief
refresher on logarithms.
A Transformation Example
Below, the table on the left shows data for independent and
dependent variables - x and y, respectively. When we apply a linear
regression to the raw data, the residual plot shows a non-random
pattern (a U-shaped curve), which suggests that the data are
nonlinear.
x 1 2 3 4 5 6 7 8 9
1 1 3 4 7 7
y 2 1 6
4 5 0 0 4 5
x 1 2 3 4 5 6 7 8 9
1. 1. 2. 3. 3. 5. 6. 8. 8.
y
14 00 45 74 87 48 32 60 66
Problem
mca-5 357
(A) I only
(B) II only
(C) III only
(D) I and II only
(E) I, II, and III
Solution
ŷ = b0 + b1x
Estimation Requirements
Problem 1
What is the 99% confidence interval for the slope of the regression
line?
Solution
mca-5 360
Y = Β0 + Β1X
Test Requirements
The test procedure consists of four steps: (1) state the hypotheses,
(2) formulate an analysis plan, (3) analyze sample data, and (4)
interpret results.
H0: Β1 = 0
Ha: Β1 ≠ 0
The null hypothesis states that the slope is equal to zero, and the
alternative hypothesis states that the slope is not equal to zero.
Using sample data, find the standard error of the slope, the slope of
the regression line, the degrees of freedom, the test statistic, and
the P-value associated with the test statistic. The approach
described in this section is illustrated in the sample problem at the
end of this lesson.
DF = n - 2
t = b1 / SE
Interpret Results
If the sample findings are unlikely, given the null hypothesis, the
researcher rejects the null hypothesis. Typically, this involves
comparing the P-value to the significance level, and rejecting the
null hypothesis when the P-value is less than the significance level.
Problem
Solution
The solution to this problem takes four steps: (1) state the
hypotheses, (2) formulate an analysis plan, (3) analyze sample
data, and (4) interpret results. We work through those steps below:
We get the slope (b1) and the standard error (SE) from the
regression output.
b1 = 0.55 SE = 0.24
DF = n - 2 = 101 - 2 = 99
Note: If you use this approach on an exam, you may also want to
mention that this approach is only appropriate when the standard
requirements for simple linear regression are satisfied.
Y = Β0 + Β1X
Test Requirements
The test procedure consists of four steps: (1) state the hypotheses,
(2) formulate an analysis plan, (3) analyze sample data, and (4)
interpret results.
H0: Β1 = 0
Ha: Β1 ≠ 0
The null hypothesis states that the slope is equal to zero, and the
alternative hypothesis states that the slope is not equal to zero.
Using sample data, find the standard error of the slope, the slope of
the regression line, the degrees of freedom, the test statistic, and
the P-value associated with the test statistic. The approach
described in this section is illustrated in the sample problem at the
end of this lesson.
In the output above, the standard error of the slope (shaded
in gray) is equal to 20. In this example, the standard error is
referred to as "SE Coeff". However, other software packages
might use a different label for the standard error. It might be
"StDev", "SE", "Std Dev", or something else.
DF = n - 2
t = b1 / SE
Interpret Results
If the sample findings are unlikely, given the null hypothesis, the
researcher rejects the null hypothesis. Typically, this involves
comparing the P-value to the significance level, and rejecting the
null hypothesis when the P-value is less than the significance level.
Problem
Solution
The solution to this problem takes four steps: (1) state the
hypotheses, (2) formulate an analysis plan, (3) analyze sample
data, and (4) interpret results. We work through those steps below:
We get the slope (b1) and the standard error (SE) from the
regression output.
b1 = 0.55 SE = 0.24
DF = n - 2 = 101 - 2 = 99
Note: If you use this approach on an exam, you may also want to
mention that this approach is only appropriate when the standard
requirements for simple linear regression are satisfied.
Coefficient of Determination
R2adj = 1 - (1-R2)(N-n-1)/N-1)
Either the nature of the data, or the regression results, may suggest
further questions. For example, you may want to obtain means and
standard deviations or histograms of variables to check on their
distributions; or plot one variable against another, or obtain a matrix
of correlations, to check on first order relationships. Minitab does
some checking for you automatically, and reports if it finds "unusual
observations". If there are unusual observations, PLOT or
HISTOGRAM may tell you what the possible problems are. The
usual kinds of unusual observations are "outliers" points which lie
far from the main distributions or the main trends of one or more
mca-5 373
The above brief paragraph does not exhaust what you can
say about a set of regression results. There may be features
of the data you should look at "Unusual observations", for
example. Normally you will need to go on to discuss the
meaning of the trends you have described.
Always report what happened before moving on to its
significance so R2adj values before F values, regression
coefficients before t values. Remember, descriptive statistics
are more important than significance tests.
Although Minitab will give a negative t value if the
corresponding regression coefficient is negative, you should
drop the negative sign when reporting the results.
Degrees of freedom for both F and t values must be given.
Usually they are written as subscripts. For F the numerator
degrees of freedom are given first. You can also put degrees
of freedom in parentheses, or report them explicitly, e.g.:
"F(3,12) = 4.32" or "F = 4.32, d. of f. = 3, 12".
Significance levels can either be reported exactly (e.g. p =
0.032) or in terms of conventional levels (e.g. p < 0.05).
There are arguments in favour of either, so it doesn't much
matter which you do. But you should be consistent in any
one report.
Beware of highly significant F or t values, whose significance
levels will be reported by statistics packages as, for
example, 0.0000. It is an act of statistical illiteracy to write p
= 0.0000; significance levels can never be exactly zero there
is always some probability that the observed data could arise
if the null hypothesis was true. What the package means is
that this probability is so low it can't be represented with the
number of columns available. We should write it as p <
0.00005.
Beware of spurious precision, i.e. reporting coefficients etc
to huge numbers of significant figures when, on the basis
of the sample you have, you couldn't possibly expect them to
replicate to anything like that degree of precision if someone
repeated the study. F and t values are conventionally
reported to two decimal places, and R2adj values to the
nearest percentage point (sometimes to one additional
decimal place). For coefficients, you should be guided by the
sample size: with a sample size of 16, as in the example
used above, two significant figures is plenty, but even with
more realistic samples, in the range of 100 to 1000, three
significant figures is usually as far as you should go. This
means that you will usually have to round off the numbers
that Minitab will give you.
mca-5 375
SOURCE DF SS MS F p
Regression 3 4065.4 1355.1 4.32 0.028
Error 12 3760.0 313.3
Total 15 7825.4
SOURCE DF SEQ SS
income 1 3940.5
m0f1 1 55.5
age 1 69.4
Continue? y
Unusual Observations
Obs. income depress Fit Stdev.Fit Residual St.Resid
7 730 12.00 -6.10 15.57 18.10 2.15RX
mca-5 376
LINEAR REGRESSION
1. Introduction
o very often when 2 (or more) variables are observed,
relationship between them can be visualized
o predictions are always required in economics or
physical science from existing and historical data
o regression analysis is used to help formulate these
predictions and relationships
o linear regression is a special kind of regression
analysis in which 2 variables are studied and a
straight-line relationship is assumed
o linear regression is important because
1. there exist many relationships that are of this
form
2. it provides close approximations to complicated
relationships which would otherwise be difficult
to describe
o the 2 variables are divided into (i) independent
variable and (ii) dependent variable
o Dependent Variable is the variable that we want to
forecast
o Independent Variable is the variable that we use to
make the forecast
o e.g. Time vs. GNP (time is independent, GNP is
dependent)
o scatter diagrams are used to graphically presenting
the relationship between the 2 variables
o usually the independent variable is drawn on the
horizontal axis (X) and the dependent variable on
vertical axis (Y)
o the regression line is also called the regression line of
Y on X
2. Assumptions
o there is a linear relationship as determined (observed)
from the scatter diagram
o the dependent values (Y) are independent of each
other, i.e. if we obtain a large value of Y on the first
observation, the result of the second and subsequent
observations will not necessarily provide a large
value. In simple term, there should not be auto-
correlation
o for each value of X the corresponding Y values are
normally distributed
o the standard deviations of the Y values for each value
of X are the same, i.e. homoscedasticity
mca-5 377
3. Process
o observe and note what is happening in a systematic
way
o form some kind of theory about the observed facts
o draw a scatter diagram to visualize relationship
o generate the relationship by mathematical formula
o make use of the mathematical formula to predict
4. Method of Least Squares
o from a scatter diagram, there is virtually no limit as to
the number of lines that can be drawn to make a
linear relationship between the 2 variables
o the objective is to create a BEST FIT line to the data
concerned
o the criterion is the called the method of least squares
o i.e. the sum of squares of the vertical deviations from
the points to the line be a minimum (based on the fact
that the dependent variable is drawn on the vertical
axis)
o the linear relationship between the dependent
variable (Y) and the independent variable can be
written as Y = a + bX , where a and b are parameters
describing the vertical intercept and the slope of the
regression line respectively
5. Calculating a and b
An Example
Accounting Statistics
X2 Y2 XY
X Y
1 74.00 81.00 5476.00 6561.00 5994.00
2 93.00 86.00 8649.00 7396.00 7998.00
3 55.00 67.00 3025.00 4489.00 3685.00
4 41.00 35.00 1681.00 1225.00 1435.00
5 23.00 30.00 529.00 900.00 690.00
6 92.00 100.00 8464.00 10000.00 9200.00
7 64.00 55.00 4096.00 3025.00 3520.00
8 40.00 52.00 1600.00 2704.00 2080.00
9 71.00 76.00 5041.00 5776.00 5396.00
10 33.00 24.00 1089.00 576.00 792.00
11 30.00 48.00 900.00 2304.00 1440.00
12 71.00 87.00 5041.00 7569.00 6177.00
Sum 687.00 741.00 45591.00 52525.00 48407.00
Mean 57.25 61.75 3799.25 4377.08 4033.92
mca-5 380
Interpretation/Conclusion
One way we can measure the error of our estimating line is to sum
all the individual differences, or errors, between the estimated
points. Let be the individual values of the estimated points and Y
be the raw data values.
Figure 1
Y diff
8 6 2
1 5 -4
6 4 2
total error 0
Figure 2
Y diff
8 2 6
1 5 -4
6 8 -2
total error 0
It also has a zero sum of error as shown from the above table.
The problem with adding the individual errors is the canceling effect
of the positive and negative values. From this, we might deduce
that the proper criterion for judging the goodness of fit would be to
add the absolute values of each error. The following table shows a
comparison between the absolute values of Figure 1 and Figure 2.
mca-5 384
Figure 1 Figure 2
Y abs. diff Y abs. diff
8 6 2 8 2 6
1 5 4 1 5 4
6 4 2 6 8 2
total error 8 total error 12
Since the absolute error for Figure 1 is smaller than that for Figure
2, we have confirmed our intuitive impression that the estimating
line in Figure 1 is the better fit.
Figure 3
mca-5 385
Figure 4
Figure 3 Figure 4
Y abs. diff Y abs. diff
4 4 0 4 5 1
7 3 4 7 4 3
2 2 0 2 3 1
total error 4 total error 5
proper
We have added the absolute values of the errors and found that the
estimating line in Figure 3 is a better fit than the line in Figure 4.
Intuitively, however, it appears that the line in Figure 4 is the better
fit line, because it has been moved vertically to take the middle
point into consideration. Figure 3, on the other hand, seems to
ignore the middle point completely.
Because The sum of the absolute values does not stress the
magnitude of the error.
Figure 3 Figure 4
Y abs diff square diff Y abs diff square diff
4 4 0 0 4 5 1 1
7 3 4 16 7 4 3 9
2 2 0 0 2 3 1 1
sum of squares 16 sum of squares 11
Applying the Least Square Criterion to the Estimating Lines (Fig 3 &
4)
Since we are looking for the estimating line that minimizes the sum
of the squares of the errors, we call this the Least Squares
Method.
mca-5 387
4
STATISTICS - DISPERSON
Frequency distribution
Applications
Class Interval
Cross tabulation
boxers briefs
You then determine the number of data points that reside within
each bin and construct the histogram. The bins size can be defined
by the user, by some common rule, or by software methods (such
as Minitab).
Geometric mean
Calculation
The geometric mean of a data set [a1, a2, ..., an] is given by
and
Harmonic mean
and so on.
The harmonic mean is the special case of the power mean and
is one of the Pythagorean means. In older literature, it is sometimes
called the subcontrary mean.
mca-5 394
The measures of variability for data that we look at are: the range,
the mean deviation and the standard deviation.
Example 4.4.1
55 69 93 59 68 75 62 78 97 83 .
55 59 62 68 69 75 78 83 93 97
Interquartile Ranges
mca-5 395
55 59 62 68 68 75 78 83 83 97
Q3 divide a body of data into four equal parts.
Q1 is called the lower quartile and contains the lower 25% of the
data.
Q2 is the median
Q3 is called the upper quartile and contains the upper 25% of the
data.
55 59 62 68 68 75 78 83 83 97
To find the value of Q1, first find its position. There are 10 data
items, so the position for Q1 is 10/4 = 2.5. Since this is between
items 2 and 3 we take the average of items 2 and 3.
Therefore, Q1 = 60.5.
mca-5 396
The position of Q3 from the upper end of the data is again 2.5. The
average of 83 and 83 is 83.
The mean deviation for a set of data is the average of the absolute
values of all the deviations.
Example 4.4.2
Compute the mean deviation for the traveling times for Arnie in
Example 4.1.1
minutes
On the average, each travel time varies 10.4 minutes from the
mean travel time of 72.8 minutes.
The mean deviation tells us that, on average, Arnie’s commuting
time is 72.8 minutes and that, on average, each commuting time
deviates from the mean by 10.4 minutes.
If classes start at 8:00 a.m., can you suggest a time at which Arnie
should leave each day?
The Variance and Standard Deviation
The variance is the sum of the squares of the mean deviations
divided by n where n is the number of items in the data.
The formulas for variance and for standard deviation are given:
Example 4.4.3
Compute the variance and standard deviation for Arnie's traveling
times as given in Example 4.4.1.
Solution:
Commuting Times - Arnie from Home to Fanshawe
55 -17.8 316.84
59 -13.8 190.44
62 -10.8 116.64
68 -4.8 23.04
68 -4.8 23.04
75 2.2 4.84
mca-5 398
78 5.2 27.04
83 10.2 104.04
83 10.2 104.04
97 24.2 585.64
Sum
mean = 72.8
=1495.6
Sample Variance
Sample Standard Deviation
s2 = 1495.6/10 = 149.56
problem4.4
The following data gives the number of home runs that Babe Ruth
hit in each of his 15 years with the New York Yankees baseball
team from 1920 to 1934:
54 59 35 41 46 25 47 60 54 46 49 46 41 34 22
The following are the number of home runs that Roger Maris hit in
each of the ten years he played in the major leagues from 1957 on
[data are already arrayed]:
8 13 14 16 23 26 28 33 39 61
Calculate the mean and standard deviation for each player's data
and comment on the consistency of performance of each player.
Mean, median, and mode are three kinds of "averages". There are
many "averages" in statistics, but these are, I think, the three most
common, and are certainly the three you are most likely to
encounter in your pre-statistics courses, if the topic comes up at all.
The "mean" is the "average" you're used to, where you add up all
the numbers and then divide by the number of numbers. The
"median" is the "middle" value in the list of numbers. To find the
median, your numbers have to be listed in numerical order, so you
may have to rewrite your list first. The "mode" is the value that
mca-5 399
The "range" is just the difference between the largest and smallest
values.
Find the mean, median, mode, and range for the following
list of values:
(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15
Note that the mean isn't a value from the original list. This is
a common result. You should not assume that your mean
will be one of your original numbers.
The median is the middle value, so I'll have to rewrite the list
in order:
There are nine numbers in the list, so the middle one will be
the (9 + 1) ÷ 2 = 10 ÷ 2 = 5th number:
The largest value in the list is 21, and the smallest is 13, so
the range is 21 – 13 = 8.
mean: 15
median: 14
mode: 13
range: 8
Note: The formula for the place to find the median is "( [the number
of data points] + 1) ÷ 2", but you don't have to use this formula. You
can just count in from both ends of the list until you meet in the
middle, if you prefer. Either way will work.
mca-5 400
Find the mean, median, mode, and range for the following
list of values:
1, 2, 4, 7
The mode is the number that is repeated most often, but all
the numbers appear only once. Then there is no mode.
mean: 3.5
median: 3
mode: none
range: 6
The list values were whole numbers, but the mean was a decimal
value. Getting a decimal value for the mean (or for the median, if
you have an even number of data points) is perfectly okay; don't
round your answers to try to match the format of the other numbers.
Find the mean, median, mode, and range for the following
list of values:
(8 + 9 + 10 + 10 + 10 + 11 + 11 + 11 + 12 + 13) ÷ 10
= 105 ÷ 10 = 10.5
The mode is the number repeated most often. This list has
two values that are repeated three times.
mean: 10.5
median: 10.5
modes: 10 and 11
range: 5
While unusual, it can happen that two of the averages (the mean
and the median, in this case) will have the same value.
Note: Depending on your text or your instructor, the above data set
;may be viewed as having no mode (rather than two modes), since
no single solitary number was repeated more often than any other.
I've seen books that go either way; there doesn't seem to be a
consensus on the "right" definition of "mode" in the above case. So
if you're not certain how you should answer the "mode" part of the
above example, ask your instructor before the next test.
About the only hard part of finding the mean, median, and mode is
keeping straight which "average" is which. Just remember the
following:
(In the above, I've used the term "average" rather casually. The
technical definition of "average" is the arithmetic mean: adding up
the values and then dividing by the number of values. Since you're
probably more familiar with the concept of "average" than with
"measure of central tendency", I used the more comfortable term.)
(87 + 95 + 76 + 88 + x) ÷ 5 = 85
87 + 95 + 76 + 88 + x = 425
346 + x = 425
x = 79
mca-5 403
5
STATISTICS - PROPORTION
Estimation
Introduction
Learning objectives
Types of estimation:
a) Point estimation
b) Interval estimation
Sample Proportions
Then
1. p =
2.
Example
p = 0.8
Point Estimations
Example
Point Estimations
Example:
Confidence Intervals
We are not only interested in finding the point estimate for the
mean, but also determining how accurate the point estimate is.
The Central Limit Theorem plays a key role here. We assume that
the sample standard deviation is close to the population standard
deviation (which will almost always be true for large samples).
Then the Central Limit Theorem tells us that the standard deviation
of the sampling distribution is
Example
Solution
x - 14 x - 14
-1.96 = =
2/ 0.28
mca-5 407
Hence
x - 14 = -0.55
(13.45,14.55)
Example
Solution
We get
15 tc 5 /
tc = 2.10
zc s
x
Example
Solution:
We have
We calculate:
Example
Solution
Solving for n in
Margin of Error = E = zc /
we have
E = zc
zc
=
E
mca-5 410
Example
A Subaru dealer wants to find out the age of their customers (for
advertising purposes). They want the margin of error to be 3 years
old. If they want a 90% confidence interval, how many people do
they need to know about?
Solution:
We have
E = 3, zc = 1.65
but there is no way of finding sigma exactly. They use the following
reasoning: most car customers are between 16 and 68 years old
hence the range is
Range = 68 - 16 = 52
52/4 = 13
Example
Solution
zc2p(1 - p)
E2 =
n
Multiplying by n, we get
accuracy
mca-5 412
Estimating Differences
Theorem
1 - 2
x1 = 14 x2 = 12 s1 = 5 s2 = 4 n1 =
50 n2 = 70
Now calculate
x1 - x2 = 14 - 12 = 2
2 1.7
or
[0.3, 3.7]
When either sample size is small, we can still run the statistics
provided the distributions are approximately normal. If in addition
we know that the two standard deviations are approximately equal,
then we can pool the data together to produce a pooled standard
deviation. We have the following theorem.
Pooled Estimate of
Note
mca-5 414
Example
Solution
We have
n1 = 11 n2 = 14
and
tc = 2.07
Note: in order for this to be valid, we need all four of the quantities
to be greater than 5.
Example
300 men and 400 women we asked how they felt about taxing
Internet sales. 75 of the men and 90 of the women agreed with
having a tax. Find a confidence interval for the difference in
proportions.
Solution
We have
We can calculate
mca-5 416
Solve:
As the owner of custom travel, you want to estimate the mean time
that it takes a travel agent to make the initial arrangements for
vacation pacakage.You have asked your office manager to take a
random sample of 40 vacation requests and to observe how long it
takes to complete the initial engagements. The office manager
reported a mean time of 23.4 min.You want to estimate the true
mean time using 95% confidence level. Previous time studies
indicate that the S.d of times is a relatively constant 9.8min
Solve:
A machine produces components which have a standard deviation
of 1.6cm in length. A random sample of 64 parts is selected from
the output and this sample has a mean length of 90cm.The
customer will reject the part if it is either less than 88cm or more
than 92 cm.Does the 95%confidence interval for the true mean
length of all components produced ensure acceptance by the
customer?
Solve:
mca-5 418
6
STATISTICS - HYPOTHESIS
Hypothesis Testing
>3
or
5.
H0: =5
H1: 5
3. Collect data.
mca-5 419
6. Decide to
Example
H0: = .08
H0: = 190
x = 198
and
s = 15.
Rejection Regions
We call the blue areas the rejection region since if the value of z
falls in these regions, we can say that the null hypothesis is very
unlikely so we can reject the null hypothesis
Example
Solution
We have
mca-5 422
p-values
Example:
We have
H0: = 50
H1: 50
so that
p = (1 - .9992)(2) = .002
since
Example
H0: = 110
= 1.73
tc
Since
We see that the test statistic does not fall in the critical region. We
fail to reject the null hypothesis and conclude that there is
insufficient evidence to suggest that the temperature required to
damage a computer on the average less than 110 degrees.
Example
Suppose that you interview 1000 exiting voters about who they
voted for governor. Of the 1000 voters, 550 reported that they
voted for the democratic candidate. Is there sufficient evidence to
suggest that the democratic candidate will win the election at the
.01 level?
H0: p =.5
H1: p>.5
Since it a large sample we can use the central limit theorem to say
that the distribution of proportions is approximately normal. We
compute the test statistic:
Example
Solution
The hypothesis is
H0: p = .1
H1: p > .1
We have that
Exercises
mca-5 426
Solution
H0: 1 - 2 = 0
H1: 1 - 2 > 0
x = 1 - 2
x =
and
0.4
z = = 0.988
0.405
Since this is a one tailed test, the critical value is 1.645 and 0.988
does not lie in the critical region. We fail to reject the null
hypothesis and conclude that there is insufficient evidence to
conclude that workers perform better at work when the music is on.
Using the P-Value technique, we see that the P-value associated
with 0.988 is
P = 1 - 0.8389 = 0.1611
which is larger than 0.05. Yet another way of seeing that we fail to
reject the null hypothesis.
Note: It would have been slightly more accurate had we used the
t-table instead of the z-table. To calculate the degrees of freedom,
we can take the smaller of the two numbers n1 - 1 and n2 - 1. So in
this example, a better estimate would use 39 degrees of freedom.
The t-table gives a value of 1.690 for the t.95 value. Notice that
0.988 is still smaller than 1.690 and the result is the same. This is
an example that demonstrates that using the t-table and z-table for
large samples results in practically the same results.
Example
Solution
We write:
We have:
n1 = 9, n2 = 10
x1 = 11, x2 = 12
s1 = 2, s2 = 3
so that
and
where
r1 + r2
p =
n1 + n2
and
q=1-p
Example
Is the severity of the drug problem in high school the same for boys
and girls? 85 boys and 70 girls were questioned and 34 of the boys
and 14 of the girls admitted to having tried some sort of drug. What
can be concluded at the .05 level?
Solution
H0: p1 - p2 = 0
H1: p1 - p2 0
We have
P = 1 - .9963 = 0.0037
is less than .05. Yet another way to see that we reject the null
hypothesis.
Paired Differences
Example
The best such survey is one that investigates identical twins who
have been reared in two different environments, one that is
nurturing and one that is non-nurturing. We could measure the
difference in high school GPAs between each pair. This is better
than just pooling each group individually. Our hypotheses are
Ho: d = 0
H1: d > 0
xd t sd/
Example
Suppose that ten identical twins were reared apart and the mean
difference between the high school GPA of the twin brought up in
wealth and the twin brought up in poverty was 0.07. If the standard
deviation of the differences was 0.5, find a 95% confidence interval
for the difference.
Solution
We compute
or
[-0.29, 0.43]
Chi-Square Test
For example, suppose that a cross between two pea plants yields a
population of 880 plants, 639 with green seeds and 241 with yellow
seeds. You are asked to propose the genotypes of the parents.
Your hypothesis is that the allele for green is dominant to the allele
for yellow and that the parent plants were both heterozygous for
this trait. If your hypothesis is true, then the predicted ratio of
offspring from this cross would be 3:1 (based on Mendel's laws) as
predicted from the results of the Punnett square (Figure B. 1).
The chi-square test will be used to test for the "goodness to fit"
between observed and expected data from several laboratory
investigations in this lab manual.
Table B.1
Calculating Chi-Square
Green Yellow
Observed (o) 639 241
Expected (e) 660 220
Deviation (o - e) -21 21
Deviation2 (d2) 441 441
2
d /e 0.668 2
2 2
= d /e = 2.668 . .
Table B.2
Chi-Square Distribution
Degrees
of
(df)
0.95 0.90 0.80 0.70 0.50 0.30 0.20 0.10 0.05 0.01 0.001
1 0.004 0.02 0.06 0.15 0.46 1.07 1.64 2.71 3.84 6.64 10.83
mca-5 436
2 0.10 0.21 0.45 0.71 1.39 2.41 3.22 4.60 5.99 9.21 13.82
3 0.35 0.58 1.01 1.42 2.37 3.66 4.64 6.25 7.82 11.34 16.27
4 0.71 1.06 1.65 2.20 3.36 4.88 5.99 7.78 9.49 13.28 18.47
5 1.14 1.61 2.34 3.00 4.35 6.06 7.29 9.24 11.07 15.09 20.52
6 1.63 2.20 3.07 3.83 5.35 7.23 8.56 10.64 12.59 16.81 22.46
7 2.17 2.83 3.82 4.67 6.35 8.38 9.80 12.02 14.07 18.48 24.32
8 2.73 3.49 4.59 5.53 7.34 9.52 11.03 13.36 15.51 20.09 26.12
9 3.32 4.17 5.38 6.39 8.34 10.66 12.24 14.68 16.92 21.67 27.88
10 3.94 4.86 6.18 7.27 9.34 11.78 13.44 15.99 18.31 23.21 29.59
Nonsignificant Significant
Chi-square test
Purpose:
Therefore, the hypothesis that the data are from a population with
the specified distribution is rejected if
There are many non-parametric and robust techniques that are not
based on strong distributional assumptions. By non-parametric, we
mean a technique, such as the sign test, that is not based on a
specific distributional assumption. By robust, we mean a statistical
technique that performs well under a wide range of distributional
assumptions. However, techniques based on specific distributional
assumptions are in general more powerful than these non-
parametric and robust techniques. By power, we mean the ability to
detect a difference when that difference actually exists. Therefore, if
the distributional assumption can be confirmed, the parametric
techniques are generally preferred.
If you are using a technique that makes a normality (or some other
type of distributional) assumption, it is important to confirm that this
assumption is in fact justified. If it is, the more powerful parametric
techniques can be used. If the distributional assumption is not
justified, a non-parametric or robust technique may be required.
Example
Find the
You are Find the mean,
median, Calculate a
studying one standard
interquartile proportion
set of data deviation
range (Q3-Q1)
Compare one
set of data to a Run a one- Run a Run a x2 (chi-
hypothetical sample t-test Wilcoxon Test square) test
value
Look for a
Run a
linear Run a simple
Run a linear nonparametric
relationship logistic
regression linear
between 2 regression
regression
variables
mca-5 440
Run a
Look for a
Run a power, nonparametric
non-linear
exponential, or power,
relationship
quadratic exponential, or
between 2
regression quadratic
variables
regression
See your teacher for specific details for analyses required in your
particular class. pcbryan@prodigy.net 7-
Use this test to compare the mean values (averages) of one set of
data to a predicted mean value.
Use this test to compare the mean values (averages) of two sets
of data.
Use this test to compare data from the same subjects under two
different conditions.
Use this test to compare the mean values (averages) of more than
two sets of data where there is more than one independent
variable but only one dependent variable. If you find that your
data differ significantly, this says only that at least two of the data
mca-5 442
sets differ from one another, not that all of your tested data sets
differ from one another.
One assumption in the ANOVA test is that your data are normally-
distributed (plot as a bell curve, approximately). If this is not true,
you must use the Kruskall-Wallis test below.
Statistical test you would use: Kruskall-Wallis Test. Use this test
to compare the mean values (averages) of more than two sets of
data where the data are chosen from some limited set of values or
if your data otherwise don't form a normal (bell-curve) distribution.
This example could also be done using a two-way chi-square
test.
Statistical test you would use: Wilcoxon Signed Rank Test. Use
this test to compare the mean values (averages) of two sets of
data, or the mean value of one data set to a hypothetical mean,
where the data are ranked from low to high (here, A = 1, B = 2, C =
3, D = 4, E =5 might be your rankings; you could also have set A =
5, B = 4, ...).
For this test, typically, you assume that all choices are equally likely
and test to find whether this assumption was true. You would
assume that, for 50 subjects tested, 10 chose each of the five
options listed in the example above. In this case, your observed
values (O) would be the number of subjects who chose each
response, and your expected values (E) would be 10.
Use this test when your data consist of a limited number of possible
values that your data can have. Example 2: you ask subjects which
toy they like best from a group of toys that are identical except that
they come in several different colors. Independent variable: toy
color; dependent variable: toy choice.
9. You load weights on four different lengths of the same type and
cross-sectional area of wood to see if the maximum weight a piece
of the wood can hold is directly dependent on the length of the
wood.
Fit a line to data having only one independent variable and one
dependent variable.
10. You load weights on four different lengths and four different
thicknesses of the same type of wood to see if the maximum weight
a piece of the wood can hold is directly dependent on the length
and thickness of the wood, and to find which is more important,
length or weight.
Fit a line to data having two or more independent variables and one
dependent variable.
mca-5 445
11. You load weights on strips of plastic trash bags to find how
much the plastic stretches from each weight. Research that you do
indicates that plastics stretch more and more as the weight placed
on them increases; therefore the data do not plot along a straight
line.
Fit a curve to data having only one independent variable and one
dependent variable.
Student's t Distribution
But sample sizes are sometimes small, and often we do not know
the standard deviation of the population. When either of these
problems occur, statisticians rely on the distribution of the t
statistic (also known as the t score), whose values are given by:
t = [ x - μ ] / [ s / sqrt( n ) ]
Degrees of Freedom
t = [ x - μ ] / [ s / sqrt( n ) ]
T Distribution Calculator
T Distribution
Calculator
Problem 1
Note: There are two ways to solve this problem, using the T
Distribution Calculator. Both approaches are presented below.
Solution A is the traditional approach. It requires you to compute
the t score, based on data presented in the problem description.
Then, you use the T Distribution Calculator to find the probability.
Solution B is easier. You simply enter the problem data into the T
Distribution Calculator. The calculator computes a t score "behind
the scenes", and displays the probability. Both approaches come
up with exactly the same answer.
Solution A
mca-5 449
t = [ x - μ ] / [ s / sqrt( n ) ]
t = ( 290 - 300 ) / [ 50 / sqrt( 15) ] = -10 / 12.909945 = - 0.7745966
Solution B:
This time, we will work directly with the raw data from the problem.
We will not compute the t score; the T Distribution Calculator will do
that work for us. Since we will work with the raw data, we select
"Sample mean" from the Random Variable dropdown box. Then,
we enter the following data:
Problem 2
Solution:
To solve this problem, we will work directly with the raw data from
the problem. We will not compute the t score; the T Distribution
Calculator will do that work for us. Since we will work with the raw
data, we select "Sample mean" from the Random Variable
dropdown box. Then, we enter the following data:
This is probably the most widely used statistical test of all time,
and certainly the most widely known. It is simple, straightforward,
easy to use, and adaptable to a broad range of situations.
No statistical toolbox should ever be without it.
Question Strategy
Does the Begin with a "subject pool" of seeds of the type of
presence of a plant in question. Randomly sort them into two
certain kind of groups, A and B. Plant and grow them under
mycorrhizal conditions that are identical in every respect
fungus except one: namely, that the seeds of group A
enhance the (the experimental group) are grown in a soil that
growth of a contains the fungus, while those of group B (the
certain kind of control group) are grown in a soil that does not
plant? contain the fungus. After some specified period of
time, harvest the plants of both groups and take
the relevant measure of their respective degrees
of growth. If the presence of the fungus does
enhance growth, the average measure should
prove greater for group A than for group B.
Do two types of Begin with a subject pool of college students,
music, type-I relatively homogeneous with respect to age,
and type-II, record of academic achievement, and other
have different variables potentially relevant to the performance
effects upon of such a task. Randomly sort the subjects into
the ability of two groups, A and B. Have the members of each
college group perform the series of mental tasks under
students to conditions that are identical in every respect
perform a except one: namely, that group A has music of
series of type-I playing in the background, while group B
mental tasks has music of type-II. (Note that the distinction
requiring between experimental and control group does not
concentration? apply in this example.) Conclude by measuring
how well the subjects perform on the series of
tasks under their respective conditions. Any
difference between the effects of the two types of
music should show up as a difference between
the mean levels of performance for group A and
group B.
Do two strains With this type of situation you are in effect starting
of mice, A out with two subject pools, one for strain A and
and B, differ one for strain B. Draw a random sample of size Na
with respect to from pool A and another of size Nb from pool B.
their ability to Run the members of each group through a
learn to avoid standard aversive-conditioning procedure,
an aversive measuring for each one how well and quickly the
stimulus? avoidance behavior is acquired. Any difference
between the avoidance-learning abilities of the
mca-5 452
¶Null Hypothesis
or
Ma—Mb
2 2
x source x source
x M-M = sqrt[ + ]
Na Nb
(3) This, in turn, would allow us to test the null hypothesis for any
particular Ma—Mb difference by calculating the appropriate z-
ratio
MXa—MXb
z=
M-M
mca-5 455
MXa—MXb
t =
est.i M-M
(4) To help you keep track of where the particular numerical values
are coming from beyond this point, here again are the summary
statistics for our hypothetical experiment on the effects of two
types of music:
Group A Group B
music of type-I music of type-II
Na=15 Nb=15
Ma=23.13 Mb=20.86
SSa=119.73 SSb=175.73
Ma—Mb=2.26
(3) As indicated in Chapter 9, the variance of the source
population can be estimated as
SSa+SSb
2
{s p} =
(Na—1)+(Nb—1)
119.73+175.73
2
{s p} = = 10.55
14+14
est.i M-M
{s2p} {s2p}
= sqrt [ + ]
Na Nb
10.55 10.55
= sqrt [ + ] = ±1.19
15 15
(4) And with this estimated value of i M-M in hand, we are then
able to calculate the appropriate t-ratio as
MXa—MXb
t =
est.i M-M
23.13—20.87
= = +1.9
1.19
¶Inference
The same logic would have applied to the left tail of the
distribution if our initial research hypothesis had been in the
mca-5 458
Note that this test makes the following assumptions and can be
meaningfully applied only insofar as these assumptions are met:
That the two samples are independently and randomly drawn from
the source population(s).
That the scale of measurement for both samples has the properties
of an equal interval scale.
That the source population(s) can be reasonably supposed to have
a normal distribution.
SSa+SSb
{s2p} =
(Na—1)+(Nb—1)
{s2p} {s2p}
est.i M-M = sqrt [ + ]
Na Nb
Step 4. Calculate t as
MXa—MXb
t =
est.i M-M
5
INTRODUCTION TO STEADY-STATE
QUEUEING THEORY
your tough ISE classes. As you bust your posterior (assuming that
you desire to obtain that hard to get passing mark in our class), you
should check your skills on solving homework problems in one of at
least the two following ways:
Hint: Do many problems of all types before our 2 hour exams. This
pertains to queueing, and anything covered by the syllabus. You
control your destiny in this course. No whining is anticipated after
the exams – suck it up. Imagine being a civil engineer that does
not do a full job on a bridge design and people die – partial
solutions do not make my day. So learn this stuff and learn it well.
Then consider taking ISE 704 later as you will have a major head
start on the grad students just learning simulation for the 1st time in
that class. Perhaps if you are really good, you can work on your
MBA at Otterbein later in your career.
Recall the last time that you had to wait at a supermarket checkout
counter, for a teller at your local bank, or to be served at a
fast-food restaurant. In these and many other waiting line
situations, the time spent waiting is undesirable. Adding more
checkout clerks, bank tellers, or servers is not always the most
economical strategy for improving service, so businesses need to
determine ways to keep waiting times within tolerable limits.
Distribution of Arrivals
System
Server
Customer
Arrivals
Order Taking Customer
Waiting Line and Order Leaves
Filling After Order
Is filled
(14.1)
wher
e
x = the number of arrivals in the time
period
mca-5 464
x e 0.75x e0.75
P( x) (14.2)
x! x!
The waiting line models that will be presented in Sections 14.2 and
14.3 use the Poisson probability distribution to describe the
customer arrivals at Burger Dome. In practice; you should record
the actual number of arrivals per time period for several days or
weeks: and compare the frequency distribution of the observed
number of arrivals to the Poisson probability distribution to
determine whether the Poisson probability distribution provides a
reasonable approximation of the arrival distribution.
3 0.0332
4 0.0062
5 or more 0.0010
where
= the mean number of units that can be
served per time period
A property of the Suppose that Burger Dome studied the
exponential probability order-taking and order-filling process
distribution is that there and found that the single food server
is a 0.6321 probability can process an average of 60 customer
that the random orders per hour. On a one minute basis,
variable takes on a the mean service rate would be = 60
value less than its mean. customers/60 minutes = 1 customer per
minute. For example, with = 1, we
can use equation (14.3) to compute
probabilities such as the probability an
order can be processed in 1/2 minute or
less, 1 minute or less, and waiting line 2
minutes or less. These computations
are
P(service time < 0.5 min.) = 1 - e1(0.5) 1 0.6065 0.3935
P(service time < 1.0 min.) = 1 - e1(1.0) 1 0.3679 0.6321
P(service time < 2.0 min.) = 1 - e1(2.0) 1 0.1353 0.8647
Queue Discipline
In describing a waiting line system, we must define the manner in
which the waiting units are arranged for service. For the Burger
Dome waiting line, and in general for most customer oriented
waiting lines, the units waiting for service are arranged on a
first-come, first. served basis; this approach is referred to as an
FCFS queue discipline. However, some situations call for different
queue disciplines. For example, when people wait for an elevator,
the last one on the elevator is often the first one to complete
service (i.e., the first to leave the elevator). Other types of queue
disciplines assign priorities to the waiting units and then serve the
unit with the highest priority first. In this chapter we consider only
waiting lines based on a first-come, first-served queue discipline.
Steady-State Operation
When the Burger Dome restaurant opens in the morning, no
customers are in the restaurant. Gradually, activity builds up to a
normal or steady state. The beginning or start-up period is
referred to as the transient period. The transient period ends
when the system reaches the normal or steady-state operation.
The models described in this handout cover the steady-state
operating characteristics of a waiting line.
Operating Characteristics
The following formulas can be used to compute the steady-state
operating characteristics for a single-channel waiting line with
Poisson arrivals and exponential service times, where
=
the mean number of arrivals per time
period (the mean arrival rate)
= the mean number of services per time
period (the mean service rate)
2
Lq (14.5)
( )
L Lq (14.6)
Lq
Wq (14.7)
1
W Wq (14.8)
The values of the mean arrival rate and the mean service rate
are clearly important components in determining the operating
characteristics. Equation (14.9) shows that the ratio of the mean
arrival rate to the mean service rate, / , provides the probability
that an arriving unit has to wait because the service facility is in
use. Hence, / often is referred to as the utilization factor for the
service facility.
0
0.2500
1
0.1875
2
0.1406
3
0.1055
4
0.0791
5
0.0593
6
0.0445
7 or more
0.1335
Dome can use to increase the service rate? If so, and if the mean
service rate u can be identified for each alternative, equations
(14.4) through (14.10) can be used to determine the revised
operating characteristics and any improvements in the waiting line
system. The added cost of any proposed change can be compared
to the corresponding service improvements to help the manager
determine whether the proposed service improvements are
worthwhile.
Notes:
1. The assumption that arrivals follow a Poisson probability
distribution is equivalent to the assumption that the time
between arrivals has an exponential probability distribution.
For example, if the arrivals for a waiting line follow a Poisson
probability distribution with a mean of 20 arrivals per hour,
the time between arrivals will follow an exponential
probability distribution, with a mean time between arrivals of
1/20 or 0.05 hour.
System
Channel 1
Server A
Server B
( / ) k
Lq P0 (14.12)
( k 1)!(k ) 2
L Lq (14.13)
Lq
Wq (14.14)
1
W Wq (14.15)
k
1 k
Pw P0 (14.16)
k ! k
( / ) n
Pn P0 for n k (14.17)
n!
( / ) n
Pn P0 for n k (14.18)
k !k ( n k )
PW (0.4545) 0.2045
2! 1 2(1) 0.75
Ratio / 2 3 4 5
0.15 0.8605 0.8607 0.8607 0.8607
0.20 0.8182 0.8187 0.8187 0.8187
0.25 0.7778 0.7788 0.7788 0.7788
0.30 0.7391 0.7407 0.7408 0.7408
0.35 0.7021 0.7046 0.7047 0.7047
0.40 0.6667 0.6701 0.6703 0.6703
0.45 0.6327 0.6373 0.6376 0.6376
0.50 0.6000 0.6061 0.6065 0.6065
0.55 0.5686 0.5763 0.5769 0.5769
0.60 0.5385 0.5479 0.5487 0.5488
0.65 0.5094 0.5209 0.5219 0.5220
0.70 0.4815 0.4952 0.4965 0.4966
0.75 0.4545 0.4706 0.4722 0.4724
0.80 0.4286 0.4472 0.4491 0.4493
0.85 0.4035 0.4248 0.4271 0.4274
0.90 0.3793 0.4035 0.4062 0.4065
0.95 0.3559 0.3831 0.3863 0.3867
1.00 0.3333 0.3636 0.3673 0.3678
1.20 0.2500 0.2941 0.3002 0.3011
1.40 0.1765 0.2360 0.2449 0.2463
1.60 0.1111 0.1872 0.1993 0.2014
1.80 0.0526 0.1460 0.1616 0.1646
2.00 0.1111 0.1304 0.1343
2.20 0.0815 0.1046 0.1094.
2.40 0.0562 0.0831 0.0889
2.60 0.0345 0.0651 0.0721
2.80 0.0160 0.0521 0.0581
3.00 0.0377 0.0466
3.20 0.0273 0.0372
3.40 0.0186 0.0293
3.60 0.0113 0.0228
3.80 0.0051 0.0174
4.00 0.0130
4.20 0.0093
4.40 0.0063
4.60 0.0038
4.80 0.0017
mca-5 476
LW (14.19)
Lq Wq 14.20
Lq
Wq (14.21)
mca-5 478
1
W Wq (14.22) to provide the average
Recall that we used equation (14.22) time in
the system for both the single- and multiple-channel waiting line
models [see equations (14.8) and (14.15)].
The importance of Little's flow equations is that they apply to any
waiting line model regardless of whether arrivals follow the Poisson
probability distribution and regardless of whether service times
follow the exponential probability distribution. For example, in study
of the grocery checkout counters at Murphy's Foodliner, an analyst
concluded that arrivals follow the Poisson probability distribution
with the mean arrival rate of 24 customers per hour or = 24/60 =
0.40 customers per minute. However, the analyst found that
service times follow a normal probability distribution rather than an
exponential probability distribution. The mean service rate was
found to be 30 customers per hour or = 30/60 = 0.50 customers
per minute. A time study of actual customer waiting times showed
that, on average, a customer spends 4.5 minutes in the system
(waiting time plus checkout time); that is, W = 4.5. Using the
waiting line relationships discussed in this section, we can now
compute other operating characteristics for this waiting line. First,
using equation (14.22) and solving for Wq, we have
1 1
Wq W 4.5 2.5 minutes
0.50
Note: In waiting line systems where the length of the waiting line is
limited (e.g., a small waiting area), some arriving units will be
blocked from joining the waiting line and will be lost. In this case,
mca-5 479
the blocked or lost arrivals will make the mean number of units
entering the system something less than the mean arrival rate. By
defining as the mean number of units joining the system, rather
than the mean arrival rate, the relationships discussed in this
section can be used to determine W, L, Wq, and Wq.
Figure 14.4 shows the general shape of the cost curves in the
economic analysis of waiting lines. The service cost increases as
the number of channels is increased. However, with more
channels, the service is better. As a result, waiting time and cost
decrease as the number of channels is increased. The number of
channels that will provide a good approximation of the minimum
total cost design can be found by evaluating the total cost for
several design alternatives.
Total Cost
Total
Cost Service Cost
per Hour
mca-5 481
Waiting Cost
A/B/k
where
A denotes the probability distribution for the arrivals
B denotes the probability distribution for the service time
k denotes the number of channels
Queueing Theory
p0 = 1 - t.
U = 1 - p0 = t.
Markov chains
Communication Delays
Before we proceed further, lets understand the different
components of delay in a messaging system. The total delay
experienced by messages can be classified into the following
categories:
Little's Theorem
We begin our analysis of queueing systems by understanding
Little's Theorem. Little's theorem states that:
A/S/n
Poisson Arrivals
M/M/1 queueing systems assume a Poisson arrival process. This
assumption is a very good approximation for arrival process in real
systems that meet the following rules:
Cars on a Highway
As you can see these assumptions are fairly general, so they apply
to a large variety of systems. Lets consider the example of cars
entering a highway. Lets see if the above rules are met.
Telephony Arrivals
Again, if all the rules are not met, we cannot assume telephone
arrivals are Poisson. If the telephone exchange is a PABX catering
to a few subscribers, the total number of customers is small, thus
we cannot assume that rule 1 and 2 apply. If rule 1 and 2 do apply
but telephone calls are being initiated due to some disaster, calls
cannot be considered independent of each other. This violates rule
3.
Where:
In the figure below, we have just plotted the impact of one arrival
rate. If another graph was plotted after doubling the arrival rate (1
car every 5 seconds), the probability of not seeing a car in an
interval would fall much more steeply.
This result can be generalized in all cases where user sessions are
involved.
Transmission Delays
Single Server
With M/M/1 we have a single server for the queue. Suitability of
M/M/1 queueing is easy to identify from the server standpoint. For
example, a single transmit queue feeding a single link qualifies as a
single server and can be modeled as an M/M/1 queueing system. If
mca-1 492
M/M/1 Results
As we have seen earlier, M/M/1 can be applied to systems that
meet certain criteria. But if the system you are designing can be
modeled as an M/M/1 queueing system, you are in luck. The
equations describing a M/M/1 queueing system are fairly straight
forward and easy to use.
Lastly we obtain the total waiting time (including the service time):
mca-1 493
Queuing theory
Given the important of response time (and
throughtput), we need a means of computing values
for these metrics.
Queuing theory
Elements of a queuing system:
Request & arrival rate
o This is a single "request for service".
o The rate at which requests are
generated is the arrival rate .
Queue
mca-1 494
Queuing theory
Useful statistics
Length queue ,Time queue
o These are the average length of the
queue and the average time a request
spends waiting in the queue .
Queuing theory
Useful statistics
Little's Law
o The mean number of tasks in the
system = arrival rate * mean
response time .
in such a
state.
Server utilization
o This is just
Queuing theory
Queue discipline
o This is the order in which
requests are delivered to the
server.
this isn't a
class on
queuing
theory.)
Queuing theory
Example: Given:
Processor sends 10 disk I/O per second (which are
exponentially distributed).
Average disk service time is 20 ms.
Queuing theory
Basic assumptions made about problems:
System is in equilibrium.
Interarrival time (time between two successive requests
arriving) is exponentially distributed.
Infinite number of requests.
Server does not need to delay between servicing requests.
No limit to the length of the queue and queue is FIFO.
All requests must be completed at some point.
oG = general service
distribution (i.e. not
exponential)
o 1 = server can serve 1 request
at a time
Benchmarks:
Transaction processing
o The purpose of these benchmarks is to
determine how many small (and
usually random) requests a system
can satisfy in a given period of time.
Self-scaling I/O
Number of processes.
This is varied to control
concurrent requests, e.g.,
the number of tasks
simultaneously issuing I/O
requests.