0% found this document useful (0 votes)

28 views542 pages

Probability Two Books

This document is a monograph titled 'Advanced Probability' by Alexander Sokol and Anders Rønn-Nielsen, focusing on the fundamentals of modern probability theory. It covers topics such as sequences of random variables, ergodicity, weak convergence, martingales, and Brownian motion, with detailed proofs and exercises to enhance understanding. The authors emphasize the importance of exercises for mastering the theory and acknowledge contributions from various sources and individuals in the field.

Uploaded by

jnazziwa775

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views542 pages

Probability Two Books

Uploaded by

jnazziwa775

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 542

Advanced Probability

Alexander Sokol
Anders Rønn-Nielsen

Department of Mathematical Sciences

University of Copenhagen
Department of Mathematical Sciences
University of Copenhagen
Universitetsparken 5
DK-2100 Copenhagen

Copyright 2013 Alexander Sokol & Anders Rønn-Nielsen

ISBN 978-87-7078-999-8
Contents

Preface v

1 Sequences of random variables 1

1.1 Measure-theoretic preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Convergence of sequences of random variables . . . . . . . . . . . . . . . . . . 3
1.3 Independence and Kolmogorov’s zero-one law . . . . . . . . . . . . . . . . . . 15
1.4 Convergence of sums of independent variables . . . . . . . . . . . . . . . . . . 21
1.5 The strong law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2 Ergodicity and stationarity 35

2.1 Measure preservation, invariance and ergodicity . . . . . . . . . . . . . . . . . 35
2.2 Criteria for measure preservation and ergodicity . . . . . . . . . . . . . . . . . 40
2.3 Stationary processes and the law of large numbers . . . . . . . . . . . . . . . 44
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3 Weak convergence 59
3.1 Weak convergence and convergence of measures . . . . . . . . . . . . . . . . . 60
3.2 Weak convergence and distribution functions . . . . . . . . . . . . . . . . . . 67
3.3 Weak convergence and convergence in probability . . . . . . . . . . . . . . . . 69
3.4 Weak convergence and characteristic functions . . . . . . . . . . . . . . . . . 72
3.5 Central limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.6 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.7 Higher dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4 Signed measures and conditioning 103

4.1 Decomposition of signed measures . . . . . . . . . . . . . . . . . . . . . . . . 103
4.2 Conditional Expectations given a σ-algebra . . . . . . . . . . . . . . . . . . . 115
iv CONTENTS

4.3 Conditional expectations given a random variable . . . . . . . . . . . . . . . . 124

4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5 Martingales 133
5.1 Introduction to martingale theory . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.2 Martingales and stopping times . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.3 The martingale convergence theorem . . . . . . . . . . . . . . . . . . . . . . . 145
5.4 Martingales and uniform integrability . . . . . . . . . . . . . . . . . . . . . . 151
5.5 The martingale central limit theorem . . . . . . . . . . . . . . . . . . . . . . . 164
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

6 The Brownian motion 191

6.1 Definition and existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.2 Continuity of the Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . 197
6.3 Variation and quadratic variation . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.4 The law of the iterated logarithm . . . . . . . . . . . . . . . . . . . . . . . . . 215
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

7 Further reading 227

A Supplementary material 229

A.1 Limes superior and limes inferior . . . . . . . . . . . . . . . . . . . . . . . . . 229
A.2 Measure theory and real analysis . . . . . . . . . . . . . . . . . . . . . . . . . 233
A.3 Existence of sequences of random variables . . . . . . . . . . . . . . . . . . . 239
A.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

B Hints for exercises 241

B.1 Hints for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
B.2 Hints for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
B.3 Hints for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
B.4 Hints for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
B.5 Hints for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
B.6 Hints for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
B.7 Hints for Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

Bibliography 263
Preface

The purpose of this monograph is to present a detailed introduction to selected fundamentals

of modern probability theory. The focus is in particular on discrete-time and continuous-time
processes, including the law of large numbers, Lindeberg’s central limit theorem, martingales,
the martingale convergence theorem and the martingale central limit theorem, as well as
basic results on Brownian motion. The reader is assumed to have a reasonable grasp of basic
analysis and measure theory, as can be obtained through Hansen (2009), Carothers (2000)
or Ash (1972), for example.

We have endeavoured throughout to present the material in a logical fashion, with detailed
proofs allowing the reader to perceive not only the big picture of the theory, but also to
understand the finer elements of the methods of proof used. Exercises are given at the
end of each chapter, with hints for the exercises given in the appendix. The exercises form
an important part of the monograph. We strongly recommend that any reader wishing to
acquire a sound understanding of the theory spends considerable time solving the exercises.

While we share the responsibility for the ultimate content of the monograph and in partic-
ular any mistakes therein, much of the material is based on books and lecture notes from
other individuals, in particular “Videregående sandsynlighedsregning” by Martin Jacobsen,
the lecture notes on weak convergence by Søren Tolver Jensen, “Sandsynlighedsregning på
Målteoretisk Grundlag” by Ernst Hansen as well as supplementary notes by Ernst Hansen,
in particular a note on the martingale central limit theorem. We are also indebted to Ketil
Biering Tvermosegaard, who diligently translated the lecture notes by Martin Jacobsen and
thus eased the migration of their contents to their present form in this monograph.

We would like to express our gratitude to our own teachers, particularly Ernst Hansen, Martin
Jacobsen and Søren Tolver Jensen, who taught us measure theory and probability theory.
vi CONTENTS

Also, many warm thanks go to Henrik Nygaard Jensen, who meticulously read large parts of
the manuscript and gave many useful comments.

Alexander Sokol
Anders Rønn-Nielsen
København, August 2012

Since the previous edition of the book, a number of misprints and errors have been cor-
rected, and various other minor amendments have been made. We are grateful to the many
students who have contributed to the monograph by identifying mistakes and suggesting
improvements.

Alexander Sokol
Anders Rønn-Nielsen
København, June 2013
Chapter 1

Sequences of random variables

In this chapter, we will consider sequences of random variables and the basic results on such
sequences, in particular the strong law of large numbers, which formalizes the intuitive notion
that averages of independent and identically distributed events tend to the common mean.

We begin in Section 1.1 by reviewing the measure-theoretic preliminaries for our later results.
In Section 1.2, we discuss modes of convergence for sequences of random variables. The results
given in this section are fundamental to much of the remainder of this monograph, as well
as modern probability in general. In Section 1.3, we discuss the concept of independence
for families of σ-algebras, and as an application, we prove the Kolmogorov zero-one law,
which shows that for sequences of independent variables, events which, colloquially speaking,
depend only on the tail of the sequence either have probability zero or one. In Section 1.4,
we apply the results of the previous sections to prove criteria for the convergence of sums
of independent variables. Finally, in Section 1.5, we prove the strong law of large numbers,
arguably the most important result of this chapter.

1.1 Measure-theoretic preliminaries

As noted in the introduction, we assume given a level of familiarity with basic real analysis
and measure theory. Some of the main results assumed to be well-known in the following
are reviewed in Appendix A. In this section, we give an independent review of some basic
2 Sequences of random variables

results, and review particular notation related to probability theory.

We begin by recalling some basic definitions. Let Ω be some set. A σ-algebra F on Ω is a

set of subsets of Ω with the following three properties: Ω ∈ F, if F ∈ F then F c ∈ F as
well, and if (Fn )n≥1 is a sequence of sets with Fn ∈ F for n ≥ 1, then ∪∞ n=1 Fn ∈ F as well.
We refer to the second condition as F being stable under complements, and we refer to the
third condition as F being stable under countable unions. From these stability properties,
it also follows that if (Fn )n≥1 is a sequence of sets in F, ∩∞ n=1 Fn ∈ F as well. We refer
to the pair (Ω, F) as a measurable space, and we refer to the elements of F as events. A
probability measure P on (Ω, F) is a mapping P : F → [0, 1] such that P (∅) = 0, P (Ω) = 1
P∞
and whenever (Fn ) is a sequence of disjoint sets in F, it holds that n=1 P (Fn ) is convergent
P∞
and P (∪∞ n=1 Fn ) = n=1 P (Fn ). We refer to the latter property as the σ-additivity of the
probability measure P . We refer to the triple (Ω, F, P ) as a probability space.

Next, assume given a measure space (Ω, F), and let H be a set of subsets of Ω. We may then
form the set A of all σ-algebras on Ω containing H, this is a subset of the power set of the
power set of Ω. We may then define σ(H) = ∩F ∈A F, the intersection of all σ-algebras in A,
that is, the intersection of all σ-algebras containing H. This is a σ-algebra as well, and it is
the smallest σ-algebra on Ω containing H in the sense that for any σ-algebra G containing
H, we have G ∈ A and therefore σ(H) = ∩F ∈A F ⊆ G. We refer to σ(H) as the σ-algebra
generated by H, and we say that H is a generating family for σ(H).

Using this construction, we may define a particular σ-algebra on the Euclidean spaces: The
Borel σ-algebra Bd on Rd for d ≥ 1 is the smallest σ-algebra containing all open sets in Rd .
We denote the Borel σ-algebra on R by B.

Next, let (Fn )n≥1 be a sequence of sets in F. If Fn ⊆ Fn+1 for all n ≥ 1, we say that (Fn )n≥1
is increasing. If Fn ⊇ Fn+1 for all n ≥ 1, we say that (Fn )n≥1 is decreasing. Assume that
D is a set of subsets of Ω such that the following holds: Ω ∈ D, if F, G ∈ D with F ⊆ G
then G \ F ∈ D and if (Fn )n≥1 is an increasing sequence of sets in D then ∪∞ n=1 Fn ∈ D.
If this is the case, we say that D is a Dynkin class. Furthermore, if H is a set of subsets
of Ω such that whenever F, G ∈ H then F ∩ G ∈ H, then we say that H is stable under
finite intersections. These two concepts combine in the following useful manner, known as
Dynkin’s lemma: Let D be a Dynkin class on Ω, and H be a set of subsets of Ω which is
stable under finite intersections. If H ⊆ D, then σ(H) ⊆ D.

Dynkin’s lemma is useful when we desire to show that some property holds for all sets F ∈ F.
A consequence of Dynkin’s lemma is that if P and Q are two probability measures on F which
1.2 Convergence of sequences of random variables 3

are equal on a generating family for F which is stable under finite intersections, then P and
Q are equal on all of F.

Assume given a probability space (Ω, F, P ). The probability measure satisfies that for any
pair of events F, G ∈ F with F ⊆ G, P (G \ F ) = P (G) − P (F ). Also, if (Fn ) is an increasing
sequence in F, then P (∪∞ n=1 Fn ) = limn→∞ P (Fn ), and if (Fn ) is a decreasing sequence in
∞
F, then P (∩n=1 Fn ) = limn→∞ P (Fn ). These two properties are known as the upwards and
downwards continuity of probability measures, respectively.

Given a mapping X : Ω → R, we say that X is F-B measurable if it holds for all A ∈ B

that X −1 (B) ∈ F, where we use the notation X −1 (B) = {ω ∈ Ω | X(ω) ∈ B}. Letting
the σ-algebras involved be implicit, we may simply say that X is measurable. A measurable
mapping X : Ω → R is referred to as a random variable. For convenience, we also write
(X ∈ B) instead of X −1 (B) when B ⊆ R. Measurability of X ensures that whenever B ∈ B,
the subset (X ∈ B) of Ω is F measurable, such that P (X ∈ B) is well-defined. Furthermore,
R
the integral |X| dP is well-defined. In the case where it is finite, we say that X is integrable,
R
and the integral X dP is then well-defined and finite. We refer to this as the mean of X
and write EX = X dP . In the case where |X|p is integrable for some p > 0, we say that
R

X has p’th moment and write EX p = X p dP .

Also, if (Xi )i∈I is a family of variables, we denote by σ((Xi )i∈I ) the σ-algebra generated
by (Xi )i∈I , meaning the smallest σ-algebra on Ω making Xi measurable for all i ∈ I, or
equivalently, the smallest σ-algebra containing H, where H is the class of subsets (Xi ∈ B)
for i ∈ I and B ∈ B. Also, for families of variables, we write (Xi )i∈I and (Xi ) interchangably,
understanding that the index set is implicit in the latter case.

1.2 Convergence of sequences of random variables

We are now ready to introduce sequences of random variables and consider their modes of
convergence. For the remainder of the chapter, we work within the context of a probability
space (Ω, F, P ).

Definition 1.2.1. A sequence of random variables (Xn )n≥1 is a sequence of mappings from
Ω to R such that each Xn is a random variable.

If (Xn )n≥1 is a sequence of random variables, we also refer to (Xn )n≥1 as a discrete-time
4 Sequences of random variables

stochastic process, or simply a stochastic process. These names are interchangeable. For
brevity, we also write (Xn ) instead of (Xn )n≥1 . In Definition 1.2.1, all variables are assumed
to take values in R, in particular ruling out mappings taking the values ∞ or −∞ and ruling
out variables with values in Rd . This distinction is made solely for convenience, and if need
be, we will also refer to sequences of random variables with values in Rd or other measure
spaces as sequences of random variables.

A natural first question is when sequences of random variables exist with particular distri-
butions. For example, does there exists a sequence of variables (Xn ) such that (X1 , . . . , Xn )
are independent for all n ≥ 1 and such that for each n ≥ 1, Xn has some particular given
distribution? Such questions are important, and will be relevant for our later construction of
examples and counterexamples, but are not our main concern here. For completeness, results
which will be sufficient for our needs are given in Appendix A.3.

The following fundamental definition outlines the various modes of convergence of random
variables to be considered in the following.

Definition 1.2.2. Let (Xn ) be a sequence of random variables, and let X be some other
random variable.

(1). Xn converges in probability to X if for all ε > 0, limn→∞ P (|Xn − X| ≥ ε) = 0.

(2). Xn converges almost surely to X if P (limn→∞ Xn = X) = 1.

(3). Xn converges in Lp to X for some p ≥ 1 if limn→∞ E|Xn − X|p = 0.

(4). Xn converges in distribution to X if for all bounded, continuous mappings f : R → R,

limn→∞ Ef (Xn ) = Ef (X).

P a.s. Lp D
In the affirmative, we write Xn −→ X, Xn −→ X, Xn −→ X and Xn −→ X, respectively.

Definition 1.2.2 defines four modes of convergence: Convergence in probability, almost sure
convergence, convergence in Lp and convergence in distribution. Convergence in distribution
of random variables is also known as convergence in law. Note that convergence in Lp as
given in Definition 1.2.2 is equivalent to convergence in k · kp in the seminormed vector
space Lp (Ω, F, P ), see Section A.2. In the remainder of this section, we will investigate
the connections between these modes of convergence. A first question regards almost sure
convergence. The statement that P (limn→∞ Xn = X) = 1 is to be understood as that the
set {ω ∈ Ω | Xn (ω) converges to X(ω)} has probability one. For this to make sense, it is
1.2 Convergence of sequences of random variables 5

necessary that this set is measurable. The following lemma ensures that this is always the
case. For the proof of the lemma, we recall that for any family (Fi )i∈I of subsets of Ω, it
holds that

∩i∈I Fi = {ω ∈ Ω | ∀ i ∈ I : ω ∈ Fi } (1.1)
∪i∈I Fi = {ω ∈ Ω | ∃i ∈ I : ω ∈ Fi }, (1.2)

demonstrating the connection between set intersection and the universal quantifier and the
connection between set union and the existential quantifier.

Lemma 1.2.3. Let (Xn ) be a sequence of random variables, and let X be some other variable.
The subset F of Ω given by F = {ω ∈ Ω | Xn (ω) converges to X(ω)} is F measurable. In
particular, it holds that

F = ∩∞ ∞ ∞
m=1 ∪n=1 ∩k=n (|Xk − X| ≤
1
m ). (1.3)

Proof. We first prove the equality (1.3), and to this end, we first show that for any sequence
(xn ) of real numbers and any real x, it holds that xn converges to x if and only if
1
∀ m ∈ N ∃ n ∈ N ∀ k ≥ n : |xk − x| ≤ m. (1.4)

To this end, recall that xn converges to x if and only if

∀ ε > 0 ∃ n ∈ N ∀ k ≥ n : |xk − x| ≤ ε. (1.5)

It is immediate that (1.5) implies (1.4). We prove the converse implication. Therefore,
1
assume that (1.4) holds. Let ε > 0 be given. Pick a natural m ≥ 1 so large that m ≤ ε.
1
Using (1.4), take a natural n ≥ 1 such that for all k ≥ n, |xk − x| ≤ m . It then also holds
that for k ≥ n, |xk − x| ≤ ε. Therefore, (1.5) holds, and so (1.5) and (1.4) are equivalent.

We proceed to prove (1.3). Using what we already have shown, we obtain

F = {ω ∈ Ω | Xn (ω) converges to X(ω)}

= {ω ∈ Ω | ∀ ε > 0 ∃ n ∈ N ∀ k ≥ n : |Xn (ω) − X(ω)| ≤ ε}
1
= {ω ∈ Ω | ∀ m ∈ N ∃ n ∈ N ∀ k ≥ n : |Xn (ω) − X(ω)| ≤ m },

and applying (1.1) and (1.2), this yields

F = ∩∞
m=1 {ω ∈ Ω | ∃ n ∈ N ∀ k ≥ n : |Xn (ω) − X(ω)| ≤
1
m}
= ∩∞ ∞
m=1 ∪n=1 {ω ∈ Ω | ∀ k ≥ n : |Xn (ω) − X(ω)| ≤
1
m}
= ∩∞ ∞ ∞
m=1 ∪n=1 ∩k=n {ω ∈ Ω | |Xn (ω) − X(ω)| ≤
1
m}
= ∩∞ ∞ ∞
m=1 ∪n=1 ∩k=n (|Xk − X| ≤
1
m ),
6 Sequences of random variables

as desired. We have now proved (1.3). Next, as Xk and X both are F measurable mappings,
1
|Xk − X| is F measurable as well, so set (|Xk − X| ≤ m ) is in F. As a consequence, we
∞ ∞ ∞ 1
obtain that ∩m=1 ∪n=1 ∩k=n (|Xk − X| ≤ m ) is an element of F. We conclude that F ∈ F,
as desired.

Lemma 1.2.3 ensures that the definition of almost sure convergence given in Definition 1.2.2
is well-formed. A second immediate question regards convergence in probability: Does it
matter whether we consider the limit of P (|Xn − X| ≥ ε) or P (|Xn − X| > ε)? The following
lemma shows that this is not the case.

Lemma 1.2.4. Let (Xn ) be a sequence of random variables, and let X be some other variable.
It holds that Xn converges in probability to X if and only if it holds that for each ε > 0,
limn→∞ P (|Xn − X| > ε) = 0.

Proof. First assume that for each ε > 0, limn→∞ P (|Xn −X| > ε) = 0. We need to show that
Xn converges in probability to X, meaning that for each ε > 0, limn→∞ P (|Xn −X| ≥ ε) = 0.
To prove this, first fix ε > 0. We then obtain

lim sup P (|Xn − X| ≥ ε) ≤ lim sup P (|Xn − X| > 2ε ) = 0,

n→∞ n→∞

so limn→∞ P (|Xn − X| ≥ ε) = 0, as desired. Conversely, if it holds for all ε > 0 that

limn→∞ P (|Xn − X| ≥ ε) = 0, we find for any ε > 0 that

lim sup P (|Xn − X| > ε) ≤ lim sup P (|Xn − X| ≥ ε) = 0,

n→∞ n→∞

which proves the other implication.

Also, we show that limits for three of the modes of convergence considered are almost surely
unique.

Lemma 1.2.5. Let (Xn ) be a sequence of random variables and let X and Y be two other
variables. Assume that Xn converges both to X and to Y in probability, almost surely or in
Lp for some p ≥ 1. Then X and Y are almost surely equal.

P P
Proof. First assume that Xn −→ X and Xn −→ Y . Fix ε > 0. Note that if |X − Xn | ≤ ε/2
and |Xn − Y | ≤ ε/2, we have |X − Y | ≤ ε. Therefore, we also find that |X − Y | > ε implies
1.2 Convergence of sequences of random variables 7

that either |X − Xn | > ε/2 or |Xn − Y | > ε/2. Hence, we obtain

P (|X − Y | ≥ ε) ≤ P (|X − Xn | + |Xn − Y | ≥ ε)

≤ P ((|X − Xn | ≥ 2ε ) ∪ (|Xn − Y | ≥ 2ε ))
≤ P (|X − Xn | ≥ 2ε ) + P (|Xn − Y | ≥ 2ε ),

so that P (|X − Y | ≥ ε) ≤ lim supn→∞ P (|X − Xn | ≥ 2ε ) + P (|Xn − Y | ≥ 2ε ) = 0. As

(|X − Y | > 0) = ∪∞ 1
n=1 (|X − Y | ≥ n ) and a union of null sets again is a null sets, we conclude
that (|X − Y | > 0) is a null set, such that X and Y are almost surely equal.

a.s. a.s.
In the case where Xn −→ X and Xn −→ Y , the result follows since limits in R are unique. If
Lp Lp
Xn −→ X and Xn −→ Y , we obtain kX − Y kp ≤ lim supn→∞ kX − Xn kp + kXn − Y kp = 0,
so E|X − Y |p = 0, yielding that X and Y are almost surely equal. Here, k · kp denotes the
seminorm on Lp (Ω, F, P ).

Having settled these preliminary questions, we next consider the question of whether some
of the modes of convergence imply another mode of convergence. Before proving our basic
theorem on this, we show a few lemmas of independent interest. In the following lemma,
f (X) denotes the random variable defined by f (X)(ω) = f (X(ω)).

Lemma 1.2.6. Let (Xn ) be a sequence of random variables, and let X be some other variable.
Let f : R → R be a continuous function. If Xn converges almost surely to X, then f (Xn )
converges almost surely to f (X). If Xn converges in probability to X, then f (Xn ) converges
in probability to f (X).

Proof. We first consider the case of almost sure convergence. Assume that Xn converges
almost surely to X. As f is continuous, we find for each ω that if Xn (ω) converges to X(ω),
f (Xn (ω)) converges to f (X(ω)) as well. Therefore,

P (f (Xn ) converges to f (X)) ≥ P (Xn converges to X) = 1,

proving the result. Next, we turn to the more difficult case of convergence in probability.
Assume that Xn converges in probability to X, we need to prove that f (Xn ) converges in
probability to f (X). Let ε > 0, we thus need to show limn→∞ P (|f (Xn ) − f (X)| > ε) = 0.
To this end, let m ≥ 1. As [−(m + 1), m + 1] is compact, f is uniformly continuous on this
set. Choose δ > 0 such that δ parries ε for this uniform continuity of f . We may assume
without loss of generality that δ ≤ 1. We then have that for x and y in [−(m + 1), m + 1],
|x − y| ≤ δ implies |f (x) − f (y)| ≤ ε. Now assume that |f (x) − f (y)| > ε. If |x − y| ≤ δ and
|x| ≤ m, we obtain x, y ∈ [−(m + 1), m + 1] and thus a contradiction with |f (x) − f (y)| > ε.
8 Sequences of random variables

Therefore, when |f (x) − f (y)| > ε, it must either hold that |x − y| > δ or |x| > m. This
yields

P (|f (Xn ) − f (X)| > ε) ≤ P ((|Xn − X| > δ) ∪ (|X| > m))

≤ P (|Xn − X| > δ) + P (|X| > m).

Note that while δ depends on m, neither δ nor m depends on n. Therefore, as Xn converges

in probability to X, the above estimate allows us to conclude

lim sup P (|f (Xn ) − f (X)| > ε) ≤ lim sup P (|Xn − X| > δ) + P (|X| > m) = P (|X| > m).
n→∞ n→∞

As m was arbitrary, we then finally obtain

lim sup P (|f (Xn ) − f (X)| > ε) ≤ lim P (|X| > m) = 0,

n→∞ m→∞

by downwards continuity. This shows that f (Xn ) converges in probability to f (X).

Lemma 1.2.7. Let X be a random variable, let p > 0 and let ε > 0. It then holds that
P (|X| ≥ ε) ≤ ε−p E|X|p .

Proof. We simply note that P (|X| ≥ ε) = E1(|X|≥ε) ≤ ε−p E|X|p 1(|X|≥ε) ≤ ε−p E|X|p , which
yields the result.

Theorem 1.2.8. Let (Xn ) be a sequence of random variables, and let X be some other
variable. If Xn converges in Lp to X for some p ≥ 1, or if Xn converges almost surely to
X, then Xn also converges in probability to X. If Xn converges in probability to X, then Xn
also converges in distribution to X.

Proof. We need to prove three implications. First assume that Xn converges in Lp to X for
some p ≥ 1, we want to show that Xn converges in probability to X. By Lemma 1.2.7, it
holds for any ε > 0 that

lim sup P (|Xn − X| ≥ ε) ≤ lim sup ε−p E|Xn − X|p = 0,

n→∞ n→∞

so P (|Xn − X| ≥ ε) converges as n tends to infinity, and the limit is zero. Therefore,

Xn converges in probability to X. Next, assume that Xn converges almost surely to X.
Again, we wish to show that Xn converges in probability to X. Fix ε > 0. Using that
1.2 Convergence of sequences of random variables 9

(|Xn − X| ≥ ε) ⊆ ∪∞ ∞
k=n (|Xk − X| ≥ ε) and that the sequence (∪k=n (|Xk − X| ≥ ε))n≥1 is
decreasing, we find

lim sup P (|Xn − X| ≥ ε) ≤ lim P (∪∞

k=n |Xk − X| ≥ ε)
n→∞ n→∞

= P (∩∞ ∞
n=1 ∪k=n |Xk − X| ≥ ε)

≤ P (Xn does not converge to X) = 0,

so Xn converges in probability to X, as desired. Finally, we need to show that if Xn converges

in probability to X, then Xn also converges in distribution to X. Assume that Xn converges
in probability to X, and let f : R → R be bounded and continuous. Let c ≥ 0 be such that
|f (x)| ≤ c for all x. Applying the triangle inequality, |f (Xn )−f (X)| ≤ |f (Xn )|+|f (X)| ≤ 2c,
and so we obtain for any ε > 0 that

|Ef (Xn ) − Ef (X)| ≤ E|f (Xn ) − f (X)|

= E|f (Xn ) − f (X)|1(|f (Xn )−f (X)|>ε) + E|f (Xn ) − f (X)|1(|f (Xn )−f (X)|≤ε)
≤ 2cP (|f (Xn ) − f (X)| > ε) + ε. (1.6)

By Lemma 1.2.6, f (Xn ) converges in probability to f (X). Therefore, (1.6) shows that
lim supn→∞ |Ef (Xn ) − Ef (X)| ≤ ε. As ε > 0 was arbitrary, this allows us to conclude
lim supn→∞ |Ef (Xn ) − Ef (X)| = 0, and as a consequence, limn→∞ Ef (Xn ) = Ef (X). This
proves the desired convergence in distribution of Xn to X.

Theorem 1.2.8 shows that among the four modes of convergence defined in Definition 1.2.2,
convergence in Lp and almost sure convergence are the strongest, convergence in probability
is weaker than both, and convergence in distribution is weaker still. There is no general
simple relationship between convergence in Lp and almost sure convergence. Note also an
essential difference between convergence in distribution and the other three modes of conver-
gence: While both convergence in Lp , almost sure convergence and convergence in probability
depend on the multivariate distribution of (Xn , X), convergence in distribution merely de-
pends on the marginal laws of Xn and X. For this reason, the theory for convergence in
distribution is somewhat different than the theory for the other three modes of convergence.
In the remainder of this chapter and the next, we only consider the other three modes of
convergence.

Example 1.2.9. Let ξ ∈ R, let σ > 0 and let (Xn ) be a sequence of random variables
such that for all n ≥ 1, Xn is normally distributed with mean ξ and variance σ 2 . Assume
Pn
furthermore that X1 , . . . , Xn are independent for all n ≥ 1. Put ξˆn = n1 k=1 Xk . We claim
that ξˆn converges in Lp to ξ for all p ≥ 1.
10 Sequences of random variables

Pn
To prove this, note that by the properties of normal distributions, n1 k=1 Xk is normally
√ Pn
distributed with mean ξ and variance n1 σ 2 . Therefore, nσ −1 (ξ − n1 k=1 Xk ) is standard
normally distributed. With mp denoting the p’th absolute moment of the standard normal
distribution, we thus obtain
n p n
!p
p
1 X σ √ −1 1 X σ p mp
E|ξ − ξˆn | = E ξ −
p
Xk = p/2 E nσ ξ− Xk = p/2 ,
n n n n
k=1 k=1

1
Pn Lp
which converges to zero, proving that n k=1 Xk −→ ξ for all p ≥ 1. ◦

The following lemma shows that almost sure convergence and convergence in probability
enjoy strong stability properties.

Lemma 1.2.10. Let (Xn ) and (Yn ) be sequences of random variables, and let X and Y
be two other random variables. If Xn converges in probability to X and Yn converges in
probability to Y , then Xn + Yn converges in probability to X + Y , and Xn Yn converges in
probability to XY . Also, if Xn converges almost surely to X and Yn converges almost surely
to Y , then Xn + Yn converges almost surely to X + Y , and Xn Yn converges almost surely to
XY .

Proof. We first show the claims for almost sure convergence. Assume that Xn converges
almost surely to X and that Yn converges almost surely to Y . Note that as addition is
continuous, we have that whenever Xn (ω) converges to X(ω) and Yn (ω) converges to Y (ω),
it also holds that Xn (ω) + Yn (ω) converges to X(ω) + Y (ω). Therefore,

P (Xn + Yn converges to X + Y ) ≥ P ((Xn converges to X) ∩ (Yn converges to Y )) = 1,

and by a similar argument, we find

P (Xn Yn converges to XY ) ≥ P ((Xn converges to X) ∩ (Yn converges to Y )) = 1,

since the intersection of two almost sure sets also is an almost sure set. This proves the
claims on almost sure convergence. Next, assume that Xn converges in probability to X and
that Yn converges in probability to Y . We first show that Xn + Yn converges in probability
to X + Y . Let ε > 0 be given. We then obtain

lim sup P (|Xn + Yn − (X + Y )| ≥ ε) ≤ lim sup P ((|Xn − X| ≥ 2ε ) ∪ (|Yn − Y | ≥ 2ε ))

n→∞ n→∞

≤ lim sup P (|Xn − X| ≥ 2ε ) + P (|Yn − Y | ≥ 2ε ) = 0,

n→∞

proving the claim. Finally, we show that Xn Yn converges in probability to XY . This will
follow if we show that Xn Yn − XY converges in probability to zero. To this end, we note the
1.2 Convergence of sequences of random variables 11

relationship Xn Yn − XY = (Xn − X)(Yn − Y ) + (Xn − X)Y + (Yn − Y )X, so by what we

already have shown, it suffices to show that each of these three terms converge in probability
to zero. For the first term, recall that for all x, y ∈ R, x2 +2xy +y 2 ≥ 0 and x2 −2xy +y 2 ≥ 0,
so that |xy| ≤ 21 x2 + 12 y 2 . Therefore, we obtain for all ε > 0 that

P (|(Xn − X)(Yn − Y )| ≥ ε) ≤ P ( 12 (Xn − X)2 + 21 (Yn − Y )2 ≥ ε)

≤ P ( 12 (Xn − X)2 ≥ 12 ε) + P ( 12 (Yn − Y )2 ≥ 12 ε)
√ √
= P (|Xn − X| ≥ ε) + P (|Yn − Y | ≥ ε).

Taking the limes superior, we conclude limn→∞ P (|(Xn − X)(Yn − Y )| ≥ ε) = 0, and thus
(Xn − X)(Yn − Y ) converges in probability to zero. Next, we show that (Xn − X)Y converges
in probability to zero. Again, let ε > 0. Consider also some m ≥ 1. We then obtain

P (|(Xn − X)Y | ≥ ε)
= P ((|(Xn − X)Y | ≥ ε) ∩ (|Y | ≤ m)) + P ((|(Xn − X)Y | ≥ ε) ∩ (|Y | > m))
1
≤ P (m|Xn − X| ≥ ε) + P (|Y | > m) = P (|Xn − X| ≥ m ε) + P (|Y | > m).

Therefore, we obtain lim supn→∞ P (|(Xn − X)Y | ≥ ε) ≤ P (|Y | > m) for all m ≥ 1, from
which we conclude lim supn→∞ P (|(Xn − X)Y | ≥ ε) ≤ limm→∞ P (|Y | > m) = 0, by down-
wards continuity. This shows that (Xn − X)Y converges in probability to zero. By a similar
argument, we also conclude that (Yn − Y )X converges in probability to zero. Combining our
results, we conclude that Xn Yn converges in probability to XY , as desired.

Lemma 1.2.10 could also have been proven using a multidimensional version of Lemma 1.2.6
and the continuity of addition and multiplication. Our next goal is to prove another con-
nection between two of the modes of convergence, namely that convergence in probability
implies almost sure convergence along a subsequence, and use this to show completeness
properties of each of our three modes of convergence, in the sense that we wish to argue that
Cauchy sequences are convergent for both convergence in Lp , almost sure convergence and
convergence in probability.

We begin by showing the Borel-Cantelli lemma, a general result which will be useful in several
contexts. Let (Fn ) be a sequence of events. We then define

(Fn i.o.) = {ω ∈ Ω | ω ∈ Fn infinitely often }

(Fn evt.) = {ω ∈ Ω | ω ∈ Fn eventually }

Note that ω ∈ Fn for infinitely many n if and only if for each n ≥ 1, there exists k ≥ n
such that ω ∈ Fk . Likewise, it holds that ω ∈ Fn eventually if and only if there exists n ≥ 1
12 Sequences of random variables

such that for all k ≥ n, ω ∈ Fk . Therefore, we also have (Fn i.o.) = ∩∞ ∞

n=1 ∪k=n Fk and
(Fn evt.) = ∪∞ ∞
n=1 ∩k=n Fk . This shows in particular that the sets (Fn i.o.) and (Fn evt.) are
measurable. Also, we obtain the equality (Fn i.o.)c = (Fnc evt.). It is customary also to write
lim supn→∞ Fn for (Fn i.o.) and lim inf n→∞ Fn for (Fn evt.), although this is not a notation
which we will be using. The main useful result about events occurring infinitely often is the
following.
P∞
Lemma 1.2.11 (Borel-Cantelli). Let (Fn ) be a sequence of events. If n=1 P (Fn ) is finite,
then P (Fn i.o.) = 0.

Proof. As the sequence of sets (∪∞k=n Fk )n≥1 is decreasing, we obtain by the downward con-
tinuity of probability measures that
∞
X
P (Fn i.o.) = P (∩∞ ∞ ∞
n=1 ∪k=n Fk ) = lim P (∪k=n Fk ) ≤ lim P (Fk ) = 0,
n→∞ n→∞
k=n

with the final equality holding since the tail sum of a convergent series always tends to
zero.

Lemma 1.2.12. Let (Xn ) be a sequence of random variables, and let X be some other
P∞
variable. Assume that for all ε > 0, n=1 P (|Xn − X| ≥ ε) is finite. Then Xn converges
almost surely to X.

Proof. Recalling (1.3), it suffices to show that

P (∩∞ ∞ ∞
m=1 ∪n=1 ∩k=n (|Xk − X| ≤
1
m )) = 1.

Fix ε > 0. By Lemma 1.2.11, we find that the set (|Xn − X| ≥ ε i.o.) has probability zero.
As (|Xn − X| < ε evt.)c = (|Xn − X| ≥ ε i.o.), we obtain P (|Xn − X| < ε evt.) = 1. As
1
ε > 0 was arbitrary, we in particular obtain P (|Xn − X| ≤ m evt.) = 1 for all m ≥ 1. As the
intersection of a countable family of almost sure events again is an almost sure event, this
yields

P (∩∞ ∞ ∞
m=1 ∪n=1 ∩k=n (|Xk − X| ≤
1
m )) = P (∩∞
m=1 (|Xk − X| ≤
1
m evt.)) = 1,

as desired.

Lemma 1.2.13. Let (Xn ) be a sequence of random variables, and let X be some other
variable. Assume that Xn converges in probability to X. There is a subsequence (Xnk )
converging almost surely to X.
1.2 Convergence of sequences of random variables 13

Proof. Let (εk )k≥1 be a sequence of positive numbers decreasing to zero. For each k, it holds
that limn→∞ P (|Xn − X| ≥ εk ) = 0. In particular, for any k, n∗ ≥ 1, we may always pick
n > n∗ such that P (|Xn − X| ≥ εk ) ≤ 2−k . Therefore, we may recursively define a strictly
increasing sequence of indicies (nk )k≥1 such that for each k, P (|Xnk − X| ≥ εk ) ≤ 2−k . We
claim that the sequence (Xnk )k≥1 satisfies the criterion of Lemma 1.2.12. To see this, let
ε > 0. As (εk )k≥1 decreases to zero, there is m such that for k ≥ m, εk ≤ ε. We then obtain
∞
X ∞
X ∞
X
P (|Xnk − X| ≥ ε) ≤ P (|Xnk − X| ≥ εk ) ≤ 2−k ,
k=m k=m k=m
P∞
which is finite. Hence, k=1 P (|Xnk − X| ≥ ε) is also finite, and Lemma 1.2.12 then shows
that Xnk converges almost surely to X.

We are now almost ready to introduce the concept of being Cauchy with respect to each of
our three modes of convergence and show that being Cauchy implies convergence.
Lemma 1.2.14. Let (Xn ) be a sequence of random variables. It then holds that

(Xn is Cauchy) = ∩∞ ∞ ∞
m=1 ∪n=1 ∩k=n (|Xn − Xk | ≤
1
m ), (1.7)

and in particular, (Xn is Cauchy) is measurable.

Proof. Recall that a sequence (xn ) in R is Cauchy if and only if

∀ ε > 0 ∃ n ∈ N ∀ k, i ≥ n : |xk − xi | ≤ ε (1.8)

We will first argue that this is equivalent to

1
∀ m ∈ N ∃ n ∈ N ∀ k ≥ n : |xk − xn | ≤ m. (1.9)

To this end, first assume that (1.8) holds. Let m ∈ N be given and choose ε > 0 so small
1
that ε ≤ m . Using (1.8), take n ∈ N so that |xk − xi | ≤ ε whenever k, i ≥ n. Then it holds in
1
particular that |xk − xn | ≤ m . Thus, (1.9) holds. To prove the converse implication, assume
1
that (1.9) holds. Let ε > 0 be given and take m ∈ N so large that m ≤ ε/2. Using (1.9),
1
take n ∈ N so that for all k ≥ n, |xk − xn | ≤ m . We then obtain that for all k, i ≥ n, it holds
2
that |xk − xi | ≤ |xk − xn | + |xi − xn | ≤ m ≤ ε. We conclude that (1.8) holds. We have now
shown that (1.8) and (1.9) are equivalent.

Using this result, we obtain

(Xn is Cauchy) = {ω ∈ Ω | ∀ ε > 0 ∃ n ∈ N ∀ k, i ≥ n : |Xk (ω) − Xn (ω)| ≤ ε}

1
= {ω ∈ Ω | ∀ m ∈ N ∃ n ∈ N ∀ k ≥ n : |Xk (ω) − Xn (ω)| ≤ m}
= ∩∞ ∞ ∞
m=1 ∪n=1 ∩k=n (|Xn − Xk | ≤
1
m ),
14 Sequences of random variables

as desired. As a consequence, the set (Xn is Cauchy) is F measurable.

We are now ready to define what it means to be Cauchy with respect to each of our modes
of convergence. In the definition, we use the convention that a double sequence (xnm )n,m≥1
converges to x as n and m tend to infinity if it holds that for all ε > 0, there is k ≥ 1 such
that |xnm − x| ≤ ε whenever n, m ≥ k. In particular, a sequence (xn )n≥1 is Cauchy if and
only if |xn − xm | tends to zero as n and m tend to infinity.

Definition 1.2.15. Let (Xn ) be a sequence of random variables. We say that Xn is Cauchy
in probability if it holds for any ε > 0 that P (|Xn − Xm | ≥ ε) tends to zero as m and n tend
to infinity. We say that Xn is almost surely Cauchy if P ((Xn ) is Cauchy) = 1. Finally, we
say that Xn is Cauchy in Lp for some p ≥ 1 if E|Xn − Xm |p tends to zero as m and n tend
to infinity.

Note that Lemma 1.2.14 ensures that the definition of being almost surely Cauchy is well-
formed, since (Xn is Cauchy) is measurable.

Theorem 1.2.16. Let (Xn ) be a sequence of random variables. If Xn is Cauchy in proba-

bility, there exists a random variable X such that Xn converges in probability to X. If Xn is
almost surely Cauchy, there exists a random variable X such that Xn converges almost surely
to X. If (Xn ) is a sequence in Lp which is Cauchy in Lp , there exists a random variable X
in Lp such that Xn converges to X in Lp .

Proof. The result on sequences which are Cauchy in Lp is immediate from Fischer’s com-
pleteness theorem, so we merely need to show the results for being Cauchy in probability and
being almost surely Cauchy.

Consider the case where Xn is almost surely Cauchy. As R equipped with the Euclidean
metric is complete, (Xn is convergent) = (Xn is Cauchy), so in particular, by Lemma 1.2.14,
the former is a measurable almost sure set. Define X by letting X = limn→∞ Xn when the
limit exists and zero otherwise. Then X is measurable, and we have

P ( lim Xn = X) = P (Xn is Cauchy) = 1,

n→∞

so Xn converges almost surely to X, proving the result for being almost surely Cauchy.

Finally, assume that Xn is Cauchy in probability. For each k, P (|Xn − Xm | ≥ 2−k ) tends
to zero as m and n tend to infinity. In particular, we find that for each k, there is n∗
1.3 Independence and Kolmogorov’s zero-one law 15

k−1
X ∞
X
|Xnk − Xni | ≤ |Xnj+1 − Xnj | ≤ |Xnj+1 − Xnj |.
j=i j=i

Now, as the tail sums of convergent sums tend to zero, the above shows that on the almost
P∞
sure set where k=1 |Xnk+1 − Xnk | is finite, (Xnk )k≥1 is Cauchy. In particular, (Xnk )k≥1
is almost surely Cauchy, so by what was already shown, there exists a variable X such that
Xnk converges almost surely to X. In order to complete the proof, we will argue that (Xn )
converges in probability to X. To this end, fix ε > 0. Let δ > 0. As (Xn ) is Cauchy in
probability, there is n∗ such that for m, n ≥ n∗ , P (|Xn − Xm | ≥ 2ε ) ≤ δ. And as Xnk
converges almost surely to X, Xnk also converges in probability to X by Theorem 1.2.8.
Therefore, for k large enough, P (|Xnk − X| ≥ 2ε ) ≤ δ. Let k be so large that this holds and
simultaneously so large that nk ≥ n∗ . We then obtain for n ≥ n∗ that

P (|Xn − X| ≥ ε) ≤ P (|Xn − Xnk | + |Xnk − X| ≥ ε)

≤ P (|Xn − Xnk | ≥ 2ε ) + P (|Xnk − X| ≥ 2ε ) ≤ 2δ.

Thus, for n large enough, P (|Xn − X| ≥ ε) ≤ 2δ. As δ was arbitrary, we conclude that
limn→∞ P (|Xn −X| ≥ ε) = 0, showing that Xn converges in probability to X. This concludes
the proof.

This concludes our preliminary investigation of convergence of sequences of random variables.

1.3 Independence and Kolmogorov’s zero-one law

In this section, we generalize the classical notion of independence of random variables and
events to a notion of independence of σ-algebras. This general notion of independence en-
compasses all types of independence which will be relevant to us.

Definition 1.3.1. Let I be some set and let (Fi )i∈I be a family of σ-algebras. We say that
the family of σ-algebras is independent if it holds for any finite sequence of distinct indicies
16 Sequences of random variables

i1 , . . . , in ∈ I and any F1 ∈ Fi1 , . . . , Fn ∈ Fin that

n
Y
P (∩nk=1 Fk ) = P (Fk ). (1.10)
k=1

The abstract definition in Definition 1.3.1 will allow us considerable convenience as regards
matters of independence. The following lemma shows that when we wish to prove indepen-
dence, it suffices to prove the equality (1.10) for generating families which are stable under
finite intersections.

Lemma 1.3.2. Let I be some set and let (Fi )i∈I be a family of σ-algebras. Assume that for
each i, Fi = σ(Hi ), where Hi is a set family which is stable under finite intersections. If it
Qn
holds for any finite sequence of distinct indicies i1 , . . . , in ∈ I that P (∩nk=1 Fk ) = k=1 P (Fk ),
where F1 ∈ Hi1 , . . . , Fn ∈ Hin , then (Fi )i∈I is independent.

Proof. We apply Dynkin’s lemma and an induction proof. We wish to show that for each n,
it holds for all sequences of n distinct indicies i1 , . . . , in ∈ I and all finite sequences of sets
Qn
F1 ∈ Fi1 , . . . , Fn ∈ Fin that P (∩nk=1 Fk ) = k=1 P (Fk ). The induction start is trivial, so it
suffices to show the induction step. Assume that the result holds for n, we wish to prove it
for n + 1. Fix a finite sequence of n + 1 distinct indicies i1 , . . . , in+1 ∈ I. We wish to show
that
n+1
Y
P (∩n+1
k=1 Fk ) = P (Fk ) (1.11)
k=1

for F1 ∈ Fi1 , . . . , Fn+1 ∈ Fin+1 . To this end, let k ≤ n + 1, and let Fj ∈ Fij for j 6= k. Define
 
 n+1
Y 
D = Fk ∈ Fik P (∩n+1 j=1 Fj ) = P (Fj ) . (1.12)
 
j=1

We claim that D is a Dynkin class. To see this, we need to prove that Ω ∈ D, that B \ A ∈ D
whenever A ⊆ B and A, B ∈ D, and that whenever (An ) is an increasing sequence in D,
∪∞
n=1 An ∈ D as well. By our induction assumption, Ω ∈ D. Let A, B ∈ D with A ⊆ B. We
then obtain

P ((B \ A) ∩ ∩j6=k Fj ) = P (B ∩ Ac ∩ ∩j6=k Fj ) = P ((B ∩ ∩j6=k Fj ) ∩ Ac )

= P ((B ∩ ∩j6=k Fj ) ∩ (A ∩ ∩j6=k Fj )c )
= P (B ∩ ∩j6=k Fj ) − P (A ∩ ∩j6=k Fj )
Y Y Y
= P (B) P (Fj ) − P (A) P (Fj ) = P (B \ A) P (Fj ),
j6=k j6=k j6=k
1.3 Independence and Kolmogorov’s zero-one law 17

so that B \ A ∈ D. Finally, let (An ) be an increasing sequence of sets in D. We then obtain

P ((∪∞ ∞
n=1 An ) ∩ ∩j6=k Fj ) = P (∪n=1 An ∩ ∩j6=k Fj ) = lim P (An ∩ ∩j6=k Fj )
n→∞
Y Y
= lim P (An ) P (Fj ) = P (∪∞
n=1 An ) P (Fj ),
n→∞
j6=k j6=k

so ∪∞
n=1 An ∈ D. This shows that D is a Dynkin class.

We are now ready to argue that (1.11) holds. Note that by our assumption, we know that
(1.11) holds for F1 ∈ Hi1 , . . . , Fn+1 ∈ Hin+1 . Consider F2 ∈ Hi2 , . . . , Fn+1 ∈ Hin+1 . The
family D as defined in (1.12) then contains Hi1 , and so Dynkin’s lemma yields Fi1 = σ(Hi1 ) ⊆
D. This shows that (1.11) holds when F1 ∈ Fi1 and F2 ∈ Hi2 , . . . , Fn+1 ∈ Hin+1 . Next, let
F1 ∈ Fi1 and consider a finite sequence of sets F3 ∈ Hi3 , . . . , Fn+1 ∈ Hin+1 . Then D as defined
in (1.12) contains Hi2 , and therefore by Dynkin’s lemma contains σ(Hi2 ) = Fi2 , proving that
(1.11) holds when F1 ∈ Fi1 , F2 ∈ Fi2 and F3 ∈ Hi3 , . . . , Fn+1 ∈ Hin+1 . By a finite induction
argument, we conclude that (1.11) in fact holds when F1 ∈ Fi1 , . . . , Fn+1 ∈ Fin+1 , as desired.
This proves the induction step and thus concludes the proof.

The following definition shows how we may define independence between families of variables
and families of events from Definition 1.3.1.

Definition 1.3.3. Let I be some set and let (Xi )i∈I be a family of random variables. We
say that the family is independent when the family of σ-algebras (σ(Xi ))i∈I is independent.
Also, if (Fi )i∈I is a family of events, we say that the family is independent when the family
of σ-algebras (σ(1Fi ))i∈I is independent.

Next, we show that Definition 1.3.3 agrees with our usual definitions of independence.

Lemma 1.3.4. Let I be some set and let (Xi )i∈I be a family of random variables. The family
is independent if and only if it holds for any finite sequence of distinct indicies i1 , . . . , in ∈ I
Qn
and any A1 , . . . , An ∈ B that P (∩nk=1 (Xik ∈ Ak )) = k=1 P (Xik ∈ Ak ).

Proof. From Definition 1.3.3, we have that (Xi )i∈I is independent if and only if (σ(Xi ))i∈I is
independent, which by Definition 1.3.1 is the case if and only if for any finite sequence
of distinct indicies i1 , . . . , in ∈ I and any F1 ∈ σ(Xi1 ), . . . , Fn ∈ σ(Xin ) it holds that
Qn
P (∩nk=1 Fk ) = k=1 P (Fk ). However, we have σ(Xi ) = {(Xi ∈ A) | A ∈ B} for all i ∈ I, so
Qn
the condition is equivalent to requiring that P (∩nk=1 (Xik ∈ Ak )) = k=1 P (Xik ∈ Ak ) for
any finite sequence of distinct indicies i1 , . . . , in ∈ I and any A1 , . . . , An ∈ B. This proves
the claim.
18 Sequences of random variables

Lemma 1.3.5. Let I be some set and let (Fi )i∈I be a family of events. The family is
independent if and only if it holds for any finite sequence of distinct indicies i1 , . . . , in ∈ I
Qn
that P (∩nk=1 Fik ) = k=1 P (Fik ).

Proof. From Definition 1.3.3, (Fi )i∈I is independent if and only if (σ(1Fi ))i∈I is independent.
Note that for all i ∈ I, σ(1Fi ) = {Ω, ∅, Fi , Fic }, so σ(1Fi ) is generated by {Fi }. Therefore,
Lemma 1.3.2 yields the conclusion.

We will also have need of the following properties of independence.

Lemma 1.3.6. Let I be some set and let (Fi )i∈I be a family of σ-algebras. Let (Gi )i∈I be
another family of σ-algebras, and assume that Gi ⊆ Fi for all i ∈ I. If (Fi )i∈I is independent,
so is (Gi )i∈I .

Proof. This follows immediately from Definition 1.3.1.

Lemma 1.3.7. Let I be some set and let (Xi )i∈I be a family of independent variables. For
each i, let ψi : R → R be some measurable mapping. Then (ψi (Xi ))i∈I is also independent.

Proof. As σ(ψi (Xi )) ⊆ σ(Xi ), this follows from Lemma 1.3.6.

Lemma 1.3.8. Let I be some set and let (Fi )i∈I be an independent family of σ-algebras.
Let J, J 0 ⊆ I and assume that J and J 0 are disjoint. Then, the σ-algebras σ((Fi )i∈J ) and
σ((Fi )i∈J 0 ) are independent.

Proof. Let G = σ((Fi )i∈J ) and G 0 = σ((Fi )i∈J 0 ). We define

H = {∩nk=1 Fk | n ≥ 1, i1 , . . . , in ∈ J and F1 ∈ Fi1 , . . . , Fn ∈ Fin }

0
H0 = {∩nk=1 Gk | n0 ≥ 1, i01 , . . . , i0n ∈ J 0 and G1 ∈ Fi01 , . . . , Gn ∈ Fi0n }.

Then H and H0 are generating families for G and G 0 , respectively, stable under finite in-
tersections. Now let F ∈ H and G ∈ H0 . Then, there exists n, n0 ≥ 1, i1 , . . . , in ∈ J
and i01 , . . . , i0n ∈ J 0 and F1 ∈ Fi1 , . . . , Fn ∈ Fin and G1 ∈ Fi01 , . . . , Gn ∈ Fi0n0 such that
0
F = ∩nk=1 Fk and G = ∩nk=1 Gk . Since J and J 0 are disjoint, the sequence i1 , . . . , in , i01 , . . . , i0n0
consists of distinct indicies. As (Fi )i∈I is independent, we then obtain
n
!  n0 
0 Y Y
P (F ∩ G) = P ((∩nk=1 Fk ) ∩ (∩nk=1 Gk )) = P (Fk )  P (Gk ) = P (F )P (G).
k=1 k=1
1.3 Independence and Kolmogorov’s zero-one law 19

Therefore, Lemma 1.3.2 shows that G and G 0 are independent, as desired.

Before ending the section, we show some useful results where independence is involved.

Definition 1.3.9. Let (Xn ) be a sequence of random variables. The tail σ-algebra of (Xn )
is defined as the σ-algebra ∩∞
n=1 σ(Xn , Xn+1 , . . .).

Colloquially speaking, the tail σ-algebra of (Xn ) consists of events which only depend on the
tail properties of (Xn ). For example, as we will see shortly, the set where (Xn ) is convergent
is an element in the tail σ-algebra.

Theorem 1.3.10 (Kolmogorov’s zero-one law). Let (Xn ) be a sequence of independent vari-
ables. Let J be the tail σ-algebra of (Xn ). For each F ∈ J , it holds that either P (F ) = 0 or
P (F ) = 1.

Proof. Let F ∈ J and define D = {G ∈ F | P (G ∩ F ) = P (G)P (F )}, the family of sets in F

independent of F . We claim that D contains σ(X1 , X2 , . . .). To prove this, we use Dynkin’s
Lemma. We first show that D is a Dynkin class. Clearly, Ω ∈ D. If A, B ∈ D with A ⊆ B,
we obtain

P ((B \ A) ∩ F ) = P (B ∩ Ac ∩ F ) = P ((B ∩ F ) ∩ (A ∩ F )c )
= P (B ∩ F ) − P (A ∩ F ) = P (B)P (F ) − P (A)P (F ) = P (B \ A)P (F ),

so B \ A ∈ D as well. And if (Bn ) is an increasing sequence in D, we obtain

P ((∪∞ ∞
n=1 Bn ) ∩ F ) = P (∪n=1 Bn ∩ F ) = lim P (Bn ∩ F )
n→∞

= lim P (Bn )P (F ) = P (∪∞

n=1 Bn )P (F ),
n→∞

proving that ∪∞ n=1 Bn ∈ D. We have now shown that D is a Dynkin class. Now fix n ≥ 2.
As F ∈ J , it holds that F ∈ σ(Xn , Xn+1 , . . .). Since the sequence (Xn ) is independent,
Lemma 1.3.8 shows that σ(Xn , Xn+1 , . . .) is independent of σ(X1 , . . . , Xn−1 ). Therefore,
σ(X1 , . . . , Xn−1 ) ∈ D for all n ≥ 2. As the family ∪∞
n=1 σ(X1 , . . . , Xn ) is a generating family
for σ(X1 , X2 , . . .) which is stable under finite intersections, Dynkin’s lemma allows us to
conclude σ(X1 , X2 , . . .) ⊆ D. From this, we obtain J ⊆ D, so F ∈ D. Thus, for any F ∈ J ,
it holds that P (F ) = P (F ∩ F ) = P (F )2 , yielding that P (F ) = 0 or P (F ) = 1.

Example 1.3.11. Let (Xn ) be a sequence of independent variables. Recalling Lemma 1.2.14,
20 Sequences of random variables

we have for any k ≥ 1 that

((Xn )n≥1 is convergent) = ((Xn )n≥1 is Cauchy) = ((Xn )n≥k is Cauchy)

= ∩∞ ∞ ∞
m=1 ∪n=1 ∩i=n (|Xk+n−1 − Xk+i−1 | ≤
1
m ),

which is in σ(Xk , Xk+1 , . . .). As k was arbitrary, we find that ((Xn )n≥1 is convergent) is in
the tail σ-algebra of (Xn ). Thus, Theorem 1.3.10 allows us to conclude that the probability
of (Xn ) being convergent is either zero or one. ◦

Combining Theorem 1.3.10 and Lemma 1.2.11, we obtain the following useful result.

Lemma 1.3.12 (Second Borel-Cantelli). Let (Fn ) be a sequence of independent events. Then
P∞
P (Fn i.o.) is either zero or one, and the probability is zero if and only if n=1 P (Fn ) is finite.

Proof. Let J be the tail-σ-algebra of the sequence (1Fn ) of variables, Theorem 1.3.10 then
shows that J only contains sets of probability zero or one. Note that for any m ≥ 1, we
have (Fn i.o.) = ∩∞ ∞ ∞ ∞
n=1 ∪k=n Fk = ∩n=m ∪k=n Fk , so (Fn i.o.) is in J . Hence, Theorem 1.3.10
shows that P (Fn i.o.) is either zero or one.

As regards the criterion for the probability to be zero, note that from Lemma 1.2.11, we
P∞
know that if n=1 P (Fn ) is finite, then P (Fn i.o.) = 0. We need to show the converse,
P∞
namely that if P (Fn i.o.) = 0, then n=1 P (Fn ) is finite. This is equivalent to showing that
P∞
if P (Fn ) is infinite, then P (Fn i.o.) 6= 0. And to prove this, it suffices to show that if
P∞ n=1
n=1 P (F n ) is infinite, then P (Fn i.o.) = 1.

P∞
Assume that n=1 P (Fn ) is infinite. As it holds that (Fn i.o.)c = (Fnc evt.), it suffices to
show P (Fnc evt.) = 0. To do so, we note that since the sequence (Fn ) is independent, Lemma
1.3.7 shows that the sequence (Fnc ) is independent as well. Therefore,

P (Fnc evt.) = P (∪∞ ∞ c ∞ c

n=1 ∩k=n Fk ) = lim P (∩k=n Fk )
n→∞
i
Y ∞
Y
= lim lim P (∩ik=n Fkc ) = lim lim P (Fkc ) = lim P (Fkc ),
n→∞ i→∞ n→∞ i→∞ n→∞
k=n k=n

since the sequence (∩ik=n Fkc )i≥1 is decreasing. Next, note that for x ≥ 0, we have
Z x Z x Z x
d
−x = (−1) dy ≤ (− exp(−y)) dy = exp(−y) dy = exp(−x) − 1,
0 0 0 dy
1.4 Convergence of sums of independent variables 21

which implies 1 − x ≤ exp(−x). This allows us to conclude

∞
Y ∞
Y ∞
Y
lim P (Fkc ) = lim (1 − P (Fk )) ≤ lim exp(−P (Fk ))
n→∞ n→∞ n→∞
k=n k=n k=n
∞
!
X
= lim exp − P (Fk ) = 0,
n→∞
k=n

finally yielding P (Fnc evt.) = 0 and so P (Fn i.o.) = 1, as desired.

1.4 Convergence of sums of independent variables

In this section, we consider a sequence of independent variables (Xn ) and investigate when
Pn
the sum k=1 Xk converges as n tends to infinity. During the course of this section, we will
Pn Pn
encounter sequences (xn ) such that k=1 xk converges, while k=1 |xk | may not converge,
P∞
that is, series which are convergent but not absolutely convergent. In such cases, k=1 xk
is not always well-defined. However, for notational convenience, we will apply the following
P∞ Pn
convention: For a sequence (xn ), we say that k=1 xk converges when limn→∞ k=1 xk
P∞ Pn
exists, and say that k=1 xk diverges when limn→∞ k=1 xk does not exist, and in the
P∞
latter case, k=1 xk is undefined. With these conventions, we can say that we in this section
P∞
seek to understand when n=1 Xn converges for a sequence (Xn ) of independent variables.

Our first result is an example of a maximal inequality, that is, an inequality which yields
bounds on the distribution of a maximum of random variables. We will use this result to
prove a sufficient criteria for a sum of variables to converge almost surely and in L2 . Note
that in the following, just as we write EX for the expectation of a random variable X, we
write V X for the variance of X.

Theorem 1.4.1 (Kolmogorov’s maximal inequality). Let (Xk )1≤k≤n be a finite sequence of
independent random variables with mean zero and finite variance. It then holds that
k
! n
X 1 X
P max Xi ≥ ε ≤ 2 V Xk .
1≤k≤n
i=1
ε
k=1

Pk
Proof. Define Sk = i=1 Xi , we may then state the desired inequality as

P ( max |Sk | ≥ ε) ≤ ε−2 V Sn . (1.13)

1≤k≤n
22 Sequences of random variables

Let T = min{1 ≤ k ≤ n | |Sk | ≥ ε}, with the convention that the minimum of the empty
set is ∞. Colloquially speaking, T is the first time where the sequence (Sk )1≤k≤n takes an
absolute value equal to or greater than ε. Note that T takes its values in {1, . . . , n} ∪ {∞}.
And for each k ≤ n, it holds that (T ≤ k) = ∪ki=1 (|Si | ≥ ε), so in particular T is measurable.
Now, (max1≤k≤n |Sk | ≥ ε) = ∪nk=1 (|Sk | ≥ ε) = (T ≤ n). Also, whenever T is finite, it holds
that |ST | ≥ ε, so that 1 ≤ ε−2 ST2 . Therefore, we obtain

P ( max |Sk | ≥ ε) = P (T ≤ n) = E1(T ≤n) ≤ ε−2 EST2 1(T ≤n)

1≤k≤n
n
!2
X
−2 −2 −2
=ε EST2 ∧n 1(T ≤n) ≤ε EST2 ∧n =ε E Xk 1(T ≥k) . (1.14)
k=1

Expanding the square, we obtain

n
!2 n n−1 n
X X X X
2
E Xk 1(T ≥k) = EXk 1(T ≥k) + 2 EXk Xi 1(T ≥k) 1(T ≥i)
k=1 k=1 k=1 i=k+1
n
X n−1
X n
X
≤ EXk2 + 2 EXk Xi 1(T ≥k) 1(T ≥i) . (1.15)
k=1 k=1 i=k+1

Now, as (T ≥ k) = (T > k − 1) = (T ≤ k − 1)c = ∩k−1 c

i=1 (|Si | ≥ ε) for any 2 ≤ k ≤ n,
we find that (T ≥ k) is σ(X1 , . . . , Xk−1 ) measurable. In particular, for 1 ≤ k ≤ n − 1 and
k + 1 ≤ i ≤ n, we obtain that Xk , (T ≥ k) and (T ≥ i) all are σ(X1 , . . . , Xi−1 ) measurable.
As σ(X1 , . . . , Xi−1 ) is independent of σ(Xi ), this allows us to conclude

EXk Xi 1(T ≥k) 1(T ≥i) = E(Xi )EXk 1(T ≥k) 1(T ≥i) = 0, (1.16)

since Xi has mean zero. Collecting our conclusions from (1.14), (1.15) and (1.16), we obtain
Pn Pn
P (max1≤k≤n |Sk | ≥ ε) ≤ ε−2 k=1 EXk2 = ε−2 k=1 V Xk = ε−2 V Sn , as desired.

Theorem 1.4.2 (Khinchin-Kolmogorov convergence theorem). Let (Xn ) be a sequence of

P∞
independent variables with mean zero and finite variances. If it holds that n=1 V Xn is
P∞
finite, then n=1 Xn converges almost surely and in L2 .

Proof. For any sequence (xn ) in R, it holds that (xn ) is Cauchy if and only if for each m ≥ 1,
1
Pn
there is n ≥ 1 such that whenever k ≥ n+1, it holds that |xk −xn | < m . Put Sn = k=1 Xk .
We show that Sn is almost surely convergent. We have

P (Sn is convergent) = P (Sn is Cauchy)

= P (∩∞ ∞ ∞
m=1 ∪n=1 ∩k=n+1 (|Sk − Sn | ≤
1
m ))
= P (∩∞ ∞
m=1 ∪n=1 (supk≥n+1 |Sk − Sn | ≤ m1
)).
1.4 Convergence of sums of independent variables 23

As the intersection of a countable family of almost sure sets again is an almost sure set,
we find that in order to show almost sure convergence of Sn , it suffices to show that for
each m ≥ 1, ∪∞ 1
n=1 (supk≥n+1 |Sk − Sn | ≤ m ) is an almost sure set. However, we have
P (∪∞ 1 1
n=1 (supk≥n+1 |Sk − Sn | ≤ m )) ≥ P (supk≥i+1 |Sk − Si | ≤ m ) for all i ≥ 1, yielding
P (∪∞ 1 1
n=1 (supk≥n+1 |Sk − Sn | ≤ m )) ≥ lim inf n→∞ P (supk≥n+1 |Sk − Sn | ≤ m ). Combining
our conclusions, we find that in order to show the desired almost sure convergence of Sn , it
1
suffices to show limn→∞ P (supk≥n+1 |Sk − Sn | ≤ m ) = 1 for all m ≥ 1, which is equivalent
to showing

1
lim P (supk≥n+1 |Sk − Sn | > m) =0 (1.17)
n→∞

for all m ≥ 1. We wish to apply Theorem 1.4.1 to show (1.17). To do so, we first note that

P (supk≥n+1 |Sk − Sn | > 1

m) = P (∪∞
k=n+1 (maxn+1≤i≤k |Si − Sn | >
1
m ))
1
= lim P (maxn+1≤i≤k |Si − Sn | > m ), (1.18)
k→∞

1
since the sequence (maxn+1≤i≤k |Si − Sn | > m )k≥n+1 is increasing in k. Applying Theorem
1.4.1 to the independent variables Xn+1 , . . . , Xk with mean zero, we find, for k ≥ n + 1,
 
i k
X 1 X
1
P (maxn+1≤i≤k |Si − Sn | > m ) = P  max Xj >  ≤ (1/m)−2 V Xi .
n+1≤i≤k
j=n+1
m i=n+1

Therefore, recalling (1.18) and using independence, we conclude

k
X
P (supk≥n+1 |Sk − Sn | > 1
m) ≤ lim (1/m)−2 V Xi
k→∞
i=n+1
k
X ∞
X
= lim (1/m)−2 V Xi = (1/m)−2 V Xi .
k→∞
i=n+1 i=n+1

P∞
As the series n=1 V Xn is assumed convergent, the tail sums converge to zero and we finally
1
obtain limn→∞ P (supk≥n+1 |Sk − Sn | > m ) = 0, which is precisely (1.17). Thus, by our
previous deliberations, we may now conclude that Sn is almost surely convergent. It remains
to prove convergence in L2 . Let S∞ be the almost sure limit of Sn , we will show that Sn
also converges in L2 to S∞ . By an application of Fatou’s lemma, we get

E(Sn − S∞ )2 = E lim inf (Sn − Sk )2

k→∞
k
!2
X
2
≤ lim inf E(Sn − Sk ) = lim inf E Xi . (1.19)
k→∞ k→∞
i=n+1
24 Sequences of random variables

Recalling that the sequence (Xn ) consists of independent variables with mean zero, we obtain

k
!2 k k k k
X X X X X
E Xi =E Xi Xj = EXi2 = V Xi2 . (1.20)
i=n+1 i=n+1 j=n+1 i=n+1 i=n+1

P∞
Combining (1.19) and (1.20), we get E(Sn − S∞ )2 ≤ i=n+1 V Xi2 . As the series is conver-
gent, the tail sums converge to zero, so we conclude limn→∞ E(Sn − S∞ )2 = 0. This proves
convergence in L2 and so completes the proof.

Theorem 1.4.3 (Kolmogorov’s three-series theorem). Let (Xn ) be a sequence of independent

P∞
variables. Let ε > 0. Then n=1 Xn converges almost surely if the following three series are
convergent:
∞
X ∞
X ∞
X
P (|Xn | > ε), EXn 1(|Xn |≤ε) and V Xn 1(|Xn |≤ε) .
n=1 n=1 n=1

P∞
Proof. First note that as n=1 P (|Xn | > ε) is finite, we have P (|Xn | > ε i.o.) = 0 by Lemma
1.2.11, which allows us to conclude P (Xn ≤ ε evt.) = P ((Xn > ε i.o.)c ) = 1. Thus, almost
surely, the sequences (Xn ) and (Xn 1(|Xn |≤ε) ) are equal from a point onwards. Therefore,
Pn Pn
k=1 Xk converges almost surely if and only if k=1 Xn 1(|Xn |≤ε) converges almost surely,
Pn
so in order to prove the theorem, it suffices to show that k=1 Xn 1(|Xn |≤ε) converges almost
surely. To this end, define Yn = Xn 1(|Xn |≤ε) − E(Xn 1(|Xn |≤ε) ). As the sequence (Xn ) is
independent, so is the sequence (Yn ). Also, Yn has mean zero and finite variance, and by
P∞ Pn
our assumptions, n=1 V Yn is finite. Therefore, by Theorem 1.4.2, it holds that k=1 Yn
Pn
converges almost surely as n tends to infinity. Thus, k=1 Xn 1(|Xn |≤ε) −E(Xn 1(|Xn |≤ε) ) and
Pn Pn
k=1 EXk 1(|Xk |≤ε) converge almost surely, allowing us to conclude that k=1 Xn 1(|Xn |≤ε)
converges almost surely. This completes the proof.

1.5 The strong law of large numbers

In this section, we prove the strong law of large numbers, a key result in modern probability
theory. Let (Xn ) be a sequence of independent, identically distributed integrable variables
Pn
with mean µ. Intuitively speaking, we would expect that n1 k=1 Xk in some sense converges
to µ. The strong law of large numbers shows that this is indeed the case, and that the
convergence is almost sure. In order to demonstrate the result, we first show two lemmas
which will help us to prove the general statement by proving a simpler statement. Both
1.5 The strong law of large numbers 25

lemmas consider the case of nonnegative variables. Lemma 1.5.1 establishes that in order to
Pn a.s. Pn a.s.
prove n1 k=1 Xk −→ µ, it suffices to prove n1 k=1 Xk 1(Xk ≤k) −→ µ, reducing to the case of
Pn a.s.
bounded variables. Lemma 1.5.2 establishes that in order to prove n1 k=1 Xk 1(Xk ≤k) −→ µ,
nk
it suffices to prove limk→∞ n1k i=1
P
(Xi 1(Xi ≤i) −EXi 1(Xi ≤i) ) = 0 for particular subsequences
(nk )k≥1 , reducing to a subsequence, and allowing us to focus our attention on bounded
variables with mean zero.

Lemma 1.5.1. Let (Xn ) be a sequence of independent, identically distributed variables with
Pn
common mean µ. Assume that Xn ≥ 0 for all n ≥ 1. Then n1 k=1 Xk converges almost
Pn
surely if and only if n1 k=1 Xk 1(Xk ≤k) converges almost surely, and in the affirmative, the
limits are the same.

Proof. Let ν denote the common distribution of the Xn . Applying Tonelli’s theorem, we find
X∞ ∞
X X∞ Z
P (Xn 6= Xn 1(Xn ≤n) ) = P (Xn > n) = 1(x>n) dν(x)
n=1 n=1 n=1
X ∞
∞ X Z ∞ X
X k Z
= 1(k<x≤k+1) dν(x) = 1(k<x≤k+1) dν(x)
n=1 k=n k=1 n=1
X∞ Z ∞ Z
X
= k1(k<x≤k+1) dν(x) ≤ x1(k<x≤k+1) dν(x)
k=1 k=1
Z ∞
≤ x dν(x) = µ. (1.21)
0
P∞
Thus, n=1 P (Xn 6= Xn 1(Xn ≤n) ) is finite, and so Lemma 1.2.11 allows us to conclude that
P (Xn 6= Xn 1(Xn ≤n) i.o.) = 0, which then implies that P (Xn = Xn 1(Xn ≤n) evt.) = 1. Hence,
almost surely, Xn and Xn 1(Xn ≤n) are equal from a point N onwards, where N is stochastic.
For n ≥ N , we therefore have
n N
1X 1X
(Xk − Xk 1(Xk ≤k) ) = (Xk − Xk 1(Xk ≤k) ),
n n
k=1 k=1

and by rearrangement, this yields that almost surely, for n ≥ N ,

n n N
1X 1X 1X
Xk = Xk 1(Xk ≤k) + (Xk − Xk 1(Xk ≤k) ).
n n n
k=1 k=1 k=1

As the last term on the right-hand side tends almost surely to zero, the conclusions of the
lemma follows.

Lemma 1.5.2. Let (Xn ) be a sequence of independent, identically distributed variables with
common mean µ. Assume that for all n ≥ 1, Xn ≥ 0. For α > 1, define nk = [αk ], with [αk ]
26 Sequences of random variables

denoting the largest integer which is less than or equal to αk . It if holds for all α > 1 that

nk
1 X
lim (Xi 1(Xi ≤i) − EXi 1(Xi ≤i) ) = 0
k→∞ nk
i=1

1
Pn
almost surely, then n k=1 Xk 1(Xk ≤k) converges to µ almost surely.

Proof. First note that as α > 1, we have nk = [αk ] ≤ [αk+1 ] = nk+1 . Therefore, (nk ) is
increasing. Also, as [αk ] > αk − 1, nk tends to infinity as k tends to infinity. Define a
sequence (Yn ) by putting Yn = Xn 1(Xn ≤n) . Our assumption is then that for all α > 1, it
Pnk
holds that limk→∞ n1k i=1 Yi − EYi = 0 almost surely, and our objective is to demonstrate
1
Pn
that limn→∞ n k=1 Yi = µ almost surely. Let ν be the common distribution of the Xn .
Note that by the dominated convergence theorem,
Z ∞ Z ∞
lim EYn = lim EXn 1(Xn ≤n) = lim 1(x≤n) x dν(x) = x dν(x) = µ.
n→∞ n→∞ n→∞ 0 0

As convergence of a sequence implies convergence of the averages, this allows us to conclude

Pn
that limn→∞ n1 k=1 EYk = µ as well. And as convergence of a sequence implies convergence
Pnk
of any subsequence, we obtain limk→∞ n1k i=1 EYi = µ from this. Therefore, we have

nk nk nk
1 X 1 X 1 X
lim Yi = lim Yi − EYi + lim EYi = µ,
k→∞ nk k→∞ nk k→∞ nk
i=1 i=1 i=1

1
Pn
almost surely. We will use this to prove that n k=1 Yk converges to µ. To do so, first note
that since αn − 1 < [αn ] ≤ αn , it holds that

αn+1 − 1 nk+1 αn+1 α

α − α−n = ≤ ≤ = ,
αn nk αn − 1 1 − α−n

from which it follows that limk→∞ nnk+1

k
= α. Now fix m ≥ 1 and define a sequence (k(m))m≥1
by putting k(m) = sup{i ≥ 1 | ni ≤ m}. As the sequence ({i ≥ 1 | ni ≤ m})m≥1 is increasing
in m, k(m) is increasing as well. And as nk(m)+1 > m, we find that k(m) tends to infinity as
m tends to infinity. Finally, nk(m) ≤ m ≤ nk(m)+1 , by the properties of the supremum. As
Yi ≥ 0, we thus find that

nk(m) m nk(m)+1
1 X 1 X 1 X
Yi ≤ Yi ≤ Yi .
nk(m)+1 i=1
m i=1 nk(m) i=1
1.5 The strong law of large numbers 27

We therefore obtain, using that k(m) tends to infinity as m tends to infinity,

nk(m) nk(m)
1 nk(m) 1 X 1 X
µ = lim inf Yi = lim inf Yi
α m→∞ nk(m)+1 nk(m) i=1 m→∞ nk(m)+1 i=1
m m nk(m)+1
1 X 1 X 1 X
≤ lim inf Yi ≤ lim sup Yi ≤ lim sup Yi
m→∞ m m→∞ m m→∞ n k(m)
i=1 i=1 i=1
nk(m)+1
nk(m)+1 1 X
≤ lim sup Yi = αµ.
m→∞ nk(m) nk(m)+1 i=1

In conclusion, we have now shown for all α > 1 that

m m
1 1 X 1 X
µ ≤ lim inf Yi ≤ lim sup Yi ≤ αµ.
α m→∞ m i=1 m→∞ m
i=1

Letting α tend to one strictly from above, we obtain

m m
1 X 1 X
lim inf Yi = lim sup Yi = µ,
m→∞ m i=1 m→∞ m
i=1

1
Pm
almost surely, proving that m i=1 Yi is almost surely convergent with limit µ. This con-
cludes the proof of the lemma.

Theorem 1.5.3 (The strong law of large numbers). Let (Xn ) be a sequence of independent,
Pn
identically distributed variables with common mean µ. It then holds that n1 k=1 Xk converges
almost surely to µ.

Proof. We first consider the case where Xn ≥ 0 for all n ≥ 1. Combining Lemma 1.5.1 and
Lemma 1.5.2, we find that in order to prove the result, it suffices to show that
nk
1 X
lim (Yi − EYi ) = 0 (1.22)
k→∞ nk
i=1

almost surely, where Yn = Xn 1(Xn ≤n) , nk = [αk ] and α > 1. In order to do so, by Lemma
P∞ Pnk
1.2.12, it suffices to show that for any ε > 0, k=1 P (| n1k i=1 (Yi − EYi )| ≥ ε) is finite.
Using Lemma 1.2.7, we obtain
∞ nk
! ∞ nk
!2
X 1 X 1 X 1 X
P (Yi − EYi ) ≥ ε ≤ 2 E (Yi − EYi )
nk i=1 ε nk i=1
k=1 k=1
∞ Xnk
1 X 1
= V Yi . (1.23)
ε2 n2k
k=1 i=1
28 Sequences of random variables

Now, as all terms in the above are nonnegative, we may apply Tonelli’s theorem to obtain
∞ Xnk ∞ X∞
X 1 X 1
2 V Y i = 1(i≤nk ) 2 V Yi
n
i=1 k i=1
n k
k=1 k=1
∞ X∞ ∞
X 1 X X 1
= 1(i≤nk ) V Y i = V Y i . (1.24)
n2k n2k
i=1 k=1 i=1 k:nk ≥i

We wish to identify a bound for the inner sum as a function of i. To this end, note that for
x ≥ 2, we have [x] ≥ x − 1 ≥ x2 , and for 1 ≤ x < 2, [x] = 1 ≥ x2 as well. Thus, for all x ≥ 1,
we have [x] ≥ x2 . Let mi = inf{k ≥ 1 | nk ≥ i}, we then have
∞ ∞ ∞
X 1 X 1 X 1 X 4α−2mi
2 = ≤ = 4 α−2k = ,
nk k
[α ] 2 k
(α /2)2 1 − α−2
k:nk ≥i k=mi k=mi k=mi

where we have applied the formula for summing a geometric series. Noting that mi satisfies
that αmi ≥ [αmi ] = nmi ≥ i, we obtain α−2mi = (α−mi )2 ≤ i−2 , resulting in the estimate
1 −2 −1 −2
P
k:nk ≥i n2k ≤ 4(1 − α ) i . Combining this with (1.23) and (1.24), we find that in
Pnk
order to show almost sure convergence of n1k i=1 (Yi − EYi ) to zero, it suffices to show that
P∞ 1
i=1 i2 V Yi is finite. To this end, let ν denote the common distribution of the Xn . We then
apply Tonelli’s theorem to obtain
∞ ∞ ∞ i
X 1 X 1 2
X 1 X
2
V Y i ≤ 2
EX 1
i (Xi ≤i) = 2
EXi2 1(j−1<Xi ≤j)
i=1
i i=1
i i=1
i j=1
∞ X i ∞ Z j ∞
1 j 2
Z
X X
2
X 1
= 2
x dν(x) = x dν(x) 2
. (1.25)
i=1 j=1
i j−1 j=1 j−1 i=j
i

Now, for j ≥ 2 it holds that j + 2 ≤ 2j, leading to j ≤ 2(j − 1) and therefore

∞ ∞ ∞
X 1 X 1 X 1 1 1 2
2
≤ = − = ≤ , (1.26)
i=j
i i=j
i(i − 1) i=j
i − 1 i j − 1 j
P∞ 1
P∞ 1
and the same equality for j = 1 follows as i=1 i2 = 1+ i=2 i2 ≤ 1 + 1 = 2. Combining
(1.25) and (1.26), we obtain
∞ ∞ ∞ Z j Z ∞
1 j 2
Z
X 1 X X
V Yi ≤ 2 x dν(x) ≤ 2 x dν(x) = 2 x dν(x),
i=1
i2 j=1
j j−1 j=1 j−1 0

P∞
which is finite, since the Xi has finite mean. We have now shown that i=1 i12 V Yi is conver-
Pnk
gent, and therefore we may now conclude that n1k i=1 (Yi − EYi ) converges to zero almost
Pn
surely, proving (1.22). Lemma 1.5.1 and Lemma 1.5.2 now yields that n1 k=1 Xk converges
almost surely to µ.
1.5 The strong law of large numbers 29

It remains to extend the result to the case where Xn is not assumed to be nonnegative.
Therefore, we now let (Xn ) be any sequence of independent, identically distributed variables
with mean µ. With x+ = max{0, x} and x− = max{0, −x}, Lemma 1.3.7 shows that
the sequences (Xn+ ) and (Xn− ) each are independent and identically distributed with finite
means x+ dν(x) and x− dν(x), and both sequences consist only of nonnegative variables.
R R

Therefore, from what we already have shown, we obtain

n n n Z Z
1X 1X + 1X −
lim Xk = lim Xk − lim Xk = +
x dν(x) − x− dν(x) = µ,
n→∞ n n→∞ n n→∞ n
k=1 k=1 k=1

as desired.

In fact, the convergence in Theorem 1.5.3 holds not only almost surely, but also in L1 . In
Chapter 2, we will obtain this as a consequence of a more general convergence theorem. Before
concluding the chapter, we give an example of a simple statistical application of Theorem
1.5.3.

Example 1.5.4. Consider a measurable space (Ω, F) endowed with a sequence of random
variables (Xn ). Assume given a parameter set Θ and a set of probability measures (Pθ )θ∈Θ
such that for the probability space (Ω, F, Pθ ), (Xn ) consists of independent and identically
distributed variables with second moment. Assume further that the first moment is ξθ and
that the second moment is σθ2 . Natural estimators of the mean and variance parameter
functions based on n samples are then

n n n
!2
1X 1X 1X
ξˆn = Xk and σ̂n2 = Xk − Xi .
n n n i=1
k=1 k=1

a.s.
Theorem 1.5.3 allows us to conclude that under Pθ , it holds that ξˆn −→ ξθ , and furthermore,
by two further applications of Theorem 1.5.3, as well as Lemma 1.2.6 and Lemma 1.2.10, we
obtain

n n n
!2 n n
!2
1X 2 2 X 1X 1X 2 1X a.s.
σ̂n2 = Xk − Xk Xi + Xi = Xk − Xi −→ σθ2 .
n n i=1
n i=1 n n i=1
k=1 k=1

The strong law of large numbers thus allows us to conclude that the natural estimators ξˆn
and σ̂n2 converge almost surely to the true mean and variance. ◦
30 Sequences of random variables

1.6 Exercises

Exercise 1.1. Let X be a random variable, and let (an ) be a sequence of real numbers
converging to zero. Define Xn = an X. Show that Xn converges almost surely to zero. ◦

Exercise 1.2. Give an example of a sequence of random variables (Xn ) such that (Xn )
converges in probability but does not converge almost surely to any variable. Give an example
of a sequence of random variables (Xn ) such that (Xn ) converges in probability but does not
converge in L1 to any variable. ◦

Exercise 1.3. Let (Xn ) be a sequence of random variables such that Xn is Poisson dis-
tributed with parameter 1/n. Show that Xn converges in L1 to zero. ◦

Exercise 1.4. Let (Xn ) be a sequence of random variables such that Xn is Gamma dis-
tributed with shape parameter n2 and scale parameter 1/n. Show that Xn does not converge
in L1 to any integrable variable. ◦

Exercise 1.5. Consider a probability space (Ω, F, P ) such that Ω is countable and such that
F is the power set of Ω. Let (Xn ) be a sequence of random variables (Xn ) on (Ω, F, P ), and
P a.s.
let X be another variable. Show that if Xn −→ X, then Xn −→ X. ◦

Exercise 1.6. Let (Xn ) be a sequence of random variables and let X be another variable.
P
Let (Fn ) be a sequence of sets in F. Assume that for all k, Xn 1Fk −→ X1Fk , and assume
P
that limk→∞ P (Fkc ) = 0. Show that Xn −→ X. ◦

Exercise 1.7. Let (Xn ) be a sequence of random variables and let X be another variable.
Let (εk )k≥1 be a sequence of nonnegative real numbers converging to zero. Show that Xn
converges in probability to X if and only if limn→∞ P (|Xn − X| ≥ εk ) = 0 for all k ≥ 1. ◦

Exercise 1.8. Let (Xn ) be an sequence of random variables, and let X be some other
a.s. P
variable. Show that Xn −→ X if and only if supk≥n |Xk − X| −→ 0. ◦

Exercise 1.9. Let X and Y be two variables, and define

|X − Y |
d(X, Y ) = E .
1 + |X − Y |

Show that d is a pseudometric on the space of real stochastic variables, in the sense that
d(X, Y ) ≤ d(X, Z) + d(Z, Y ), d(X, Y ) = d(Y, X) and d(X, X) = 0 for all X, Y and Z. Show
1.6 Exercises 31

that d(X, Y ) = 0 if and only if X and Y are almost surely equal. Let (Xn ) be a sequence
P
of random variables and let X be some other variable. Show that Xn −→ X if and only if
limn→∞ d(Xn , X) = 0. ◦

Exercise 1.10. Let (Xn ) be a sequence of random variables. Show that there exists a
a.s.
sequence of positive constants (cn ) such that cn Xn −→ 0. ◦

Exercise 1.11. Let (Xn ) be a sequence of i.i.d. variables with mean zero. Assume that Xn
has fourth moment. Show that for all ε > 0,
∞ ∞
n
!
X 1X 4EX14 X 1
P Xk ≥ ε ≤ .
n=1
n ε4 n=1 n2
k=1

Use this to prove the following result: For a sequence (Xn ) of i.i.d. variables with fourth
Pn a.s.
moment and mean µ, it holds that n1 k=1 Xk −→ µ. ◦

Exercise 1.12. Let (Xn ) be a sequence of random variables and let X be some other variable.
P
Assume that there is p > 1 such that supn≥1 E|Xn |p is finite. Show that if Xn −→ X, then
Lq
E|X|p is finite and Xn −→ X for 1 ≤ q < p. ◦

Exercise 1.13. Let (Xn ) be a sequence of random variables, and let X be some other
a.s.
variable. Assume that Xn −→ X. Show that for all ε > 0, there exists F ∈ F with
P (F c ) ≤ ε such that
lim sup |Xn (ω) − X(ω)| = 0,
n→∞ ω∈F

corresponding to Xn converging uniformly to X on F . ◦

Exercise 1.14. Let (Xn ) be a sequence of random variables. Let X be some other variable.
P∞ a.s.
Let p > 0. Show that if n=1 E|Xn − X|p is finite, then Xn −→ X. ◦

Exercise 1.15. Let (Xn ) be a sequence of random variables, and let X be some other
P
variable. Assume that almost surely, the sequence (Xn ) is increasing. Show that if Xn −→ X,
a.s.
then Xn −→ X. ◦

Exercise 1.16. Let (Xn ) be a sequence of random variables and let (εn ) be a sequence of
P∞ P∞
nonnegative constants. Show that if n=1 P (|Xn+1 − Xn | ≥ εn ) and n=1 εn are finite,
then (Xn ) converges almost surely to some random variable. ◦

Exercise 1.17. Let (Un ) be a sequence of i.i.d. variables with common distribution being
the uniform distribution on the unit interval. Define Xn = max{U1 , . . . , Un }. Show that Xn
32 Sequences of random variables

converges to 1 almost surely and in Lp for p ≥ 1. ◦

Exercise 1.18. Let (Xn ) be a sequence of random variables. Show that if there exists c > 0
P∞
such that n=1 P (Xn > c) is finite, then supn≥1 Xn is almost surely finite. ◦

Exercise 1.19. Let (Xn ) be a sequence of independent random variables. Show that if
P∞
supn≥1 Xn is almost surely finite, there exists c > 0 such that n=1 P (Xn > c) is finite. ◦

Exercise 1.20. Let (Xn ) be a sequence of i.i.d. random variables with common distribution
being the standard exponential distribution. Calculate P (Xn / log n > c i.o.) for all c > 0
and use the result to show that lim supn→∞ Xn / log n = 1 almost surely. ◦

Exercise 1.21. Let (Xn ) be a sequence of random variables, and let J be the corresponding
tail-σ-algebra. Let B ∈ B. Show that (Xn ∈ B i.o.) and (Xn ∈ B evt.) are in J . ◦

Exercise 1.22. Let (Xn ) be a sequence of random variables, and let J be the correspond-
ing tail-σ-algebra. Let B ∈ B and let (an ) be a sequence of real numbers. Show that if
Pn
limn→∞ an = 0, then (limn→∞ k=1 an−k+1 Xk ∈ B) is in J . ◦

Exercise 1.23. Let (Xn ) be a sequence of independent random variables concentrated on

P
{0, 1} with P (Xn = 1) = pn . Show that Xn −→ 0 if and only if limn→∞ pn = 0, and show
a.s. P ∞
that Xn −→ 0 if and only if n=1 pn is finite. ◦

Exercise 1.24. Let (Xn ) be a sequence of nonnegative random variables. Show that if
P∞ Pn
n=1 EXn is finite, then k=1 Xk is almost surely convergent. ◦

Exercise 1.25. Let (Xn ) be an i.i.d. sequence of independent random variables such that
P (Xn = 1) and P (Xn = −1) both are equal to 21 . Let (an ) be a sequence of real num-
Pn
bers. Show that the sequence k=1 ak Xk either is almost surely divergent or almost surely
P∞
convergent. Show that the sequence is almost surely convergent if n=1 a2n is finite. ◦

Exercise 1.26. Give an example of a sequence (Xn ) of independent variables with first
Pn Pn
moment such that k=1 Xk converges almost surely while k=1 EXk diverges. ◦

Exercise 1.27. Let (Xn ) be a sequence of independent random variables with EXn = 0.
P∞ Pn
Assume that n=1 E(Xn2 1(|Xn |≤1) + |Xn |1(|Xn |>1) ) is finite. Show that k=1 Xk is almost
surely convergent. ◦
1.6 Exercises 33

Exercise 1.28. Let (Xn ) be an sequence of independent and identically distributed random
variables. Show that E|X1 | is finite if and only if P (|Xn | > n i.o.) = 0. ◦

Exercise 1.29. Let (Xn ) be an sequence of independent and identically distributed random
Pn a.s.
variables. Assume that there is c such that n1 k=1 Xk −→ c. Show that E|X1 | is finite and
that EX1 = c. ◦
34 Sequences of random variables
Chapter 2

Ergodicity and stationarity

In Section 1.5, we proved the strong law of large numbers, which shows that for a sequence
(Xn ) of integrable, independent and identically distributed variables, the empirical means
converge almost surely to the true mean. A reasonable question is whether such a result may
be extended to more general cases. Consider a sequence (Xn ) where each Xn has the same
distribution ν with mean µ. If the dependence between the variables is sufficiently weak, we
may hope that the empirical means still converge to the true mean.

One fruitful case of sufficiently weak dependence turns out to be embedded in the notion
of a stationary stochastic process. The notion of stationarity is connected with the notion
of measure-preserving mappings. Our plan for this chapter is as follows. In Section 2.1, we
investigate measure-preserving mappings, in particular proving the ergodic theorem, which
is a type of law of large numbers. Section 2.2 investigates sufficient criteria for the ergodic
theorem to hold. Finally, in Section 2.3, we apply our results to stationary processes and
prove versions of the law of large numbers for such processes.

2.1 Measure preservation, invariance and ergodicity

As in the previous chapter, we work in the context of a probability space (Ω, F, P ). Our
main interest of this section will be a particular type of measurable mapping T : Ω → Ω.
Recall that for such a mapping T , the image measure T (P ) is the measure on F defined by
36 Ergodicity and stationarity

that for any F ∈ F, T (P )(F ) = P (T −1 (F )).

Definition 2.1.1. Let T : Ω → Ω be measurable. We say that T is P -measure preserving, or

measure preserving for P , or simply measure preserving, if the image measure T (P ) is equal
to P .

Another way to state Definition 2.1.1 is thus that T is measure preserving precisely when
P (T −1 (F )) = P (F ) for all F ∈ F.

Definition 2.1.2. Let T : Ω → Ω be measurable. The T -invariant σ-algebra, or simply the

invariant σ-algebra, is defined by IT = {F ∈ F | T −1 (F ) = F }.

As the operation of taking the preimage T −1 (F ) is stable under complements and countable
unions, the set family IT in Definition 2.1.2 is in fact a σ-algebra.

Definition 2.1.3. Let T : Ω → Ω be measurable and measure preserving. The mapping T is

said to be P -ergodic, or to be ergodic for P , or simply ergodic, if P (F ) is either zero or one
for all F ∈ IT .

We have now introduced three concepts: measure preservation of a mapping T , the invariant
σ-algebra for a mapping T and ergodicity for a mapping T . These will be the main objects
of study for this section. Before proceeding, we introduce a final auxiliary concept. Recall
that ◦ denotes function composition, in the sense that if T : Ω → Ω and X : Ω → R, X ◦ T
denotes the mapping from Ω to R defined by (X ◦ T )(ω) = X(T (ω)).

Definition 2.1.4. Let T : Ω → Ω be measurable, and let X be a random variable. X is said

to be T -invariant, or simply invariant, if X ◦ T = X.

We are now ready to begin preparations for the main result of this section, the ergodic
theorem. Note that for T : Ω → Ω, it is sensible to consider T ◦ T , denoted T 2 , which is
defined by (T ◦ T )(ω) = T (T (ω)), and more generally, T n for some n ≥ 1. In the following,
T denotes some measurable mapping from Ω to Ω. The ergodic theorem states that if T is
measure preserving and ergodic, it holds for any variable X with p’th moment, p ≥ 1, that
Pn
the average n1 k=1 X ◦ T k−1 converges almost surely and in Lp to the mean EX. In order
to show the result, we first need a few lemmas.

Lemma 2.1.5. Let X be a random variable. It holds that X is invariant if and only if X is
IT measurable.
2.1 Measure preservation, invariance and ergodicity 37

Proof. First assume that X is invariant, and consider A ∈ B. We need to prove that (X ∈ A)
is in IT , which is equivalent to showing T −1 (X ∈ A) = (X ∈ A). To obtain this, we simply
note that as X ◦ T = X,

T −1 (X ∈ A) = {ω ∈ Ω | T (ω) ∈ (X ∈ A)} = {ω ∈ Ω | X(T (ω)) ∈ A}

= (X ◦ T ∈ A) = (X ∈ A).

Thus, X is IT measurable. Next, assume that X is IT measurable, we wish to demonstrate

that X is invariant. Fix some x ∈ R. As {x} ∈ B, we have (X = x) ∈ IT , yielding
(X = x) = T −1 (X = x) = (X ◦ T = x). Next, fix ω ∈ Ω. We wish to show that
X(ω) = (X ◦ T )(ω). From what we just proved, it holds that X(T (ω)) = x if and only if
X(ω) = x. In particular, X(T (ω)) = X(ω) if and only if X(ω) = X(ω), and the latter is
trivially true. Thus, (X ◦ T )(ω) = X(ω), so X ◦ T = X. Hence, X is invariant.

Lemma 2.1.6. Let T be P -measure preserving. Let X be an integrable random variable.

Pn
Define Sn = k=1 X ◦ T k−1 . It then holds that EX1(supn≥1 n1 Sn >0) ≥ 0.

Proof. Fix n and define Mn = max{0, S1 , . . . , Sn }. Note that supn≥1 n1 Sn > 0 if and only if
there exists n such that n1 Sn > 0, which is the case if and only if there exists n such that
Mn > 0. As the sequence of sets ((Mn > 0))n≥1 is increasing, the dominated convergence
theorem then shows that

EX1(supn≥1 1
n Sn >0)
= EX1∪∞
n=1 (Mn >0)
= E lim X1(Mn >0) = lim EX1(Mn >0) ,
n→∞ n→∞

and so it suffices to prove that EX1(Mn >0) ≥ 0 for each n. To do so, fix n. Note that as T
is measure preserving, so is T n for all n. As Mn is nonnegative, we then have
n n X
i n X
i
X X X n(n + 1)
0 ≤ EMn ≤ E |Si | ≤ E |X| ◦ T k−1 = E|X| = E|X|,
i=1 i=1 k=1 i=1 k=1
2

which shows that Mn is integrable. As E(Mn ◦ T )1(Mn >0) ≤ E(Mn ◦ T ) = EMn by the
measure preservation property of T , (Mn ◦ T )1(Mn >0) is also integrable, and we have

EX1(Mn >0) = E(X + Mn ◦ T )1(Mn >0) − E(Mn ◦ T )1(Mn >0)

≥ E(X + Mn ◦ T )1(Mn >0) − EMn
= E(X + Mn ◦ T )1(Mn >0) − EMn 1(Mn >0) .

Therefore, it suffices to show that (X + Mn ◦ T )1(Mn >0) ≥ Mn 1(Mn >0) , To do so, note that
Pk
for 1 ≤ k ≤ n − 1, it holds that X + Mn ◦ T ≥ X + Sk ◦ T = X + i=1 X ◦ T i = Sk+1 , and
also X + Mn ◦ T ≥ X = S1 . Therefore, X + Mn ◦ T ≥ max{S1 , . . . , Sn }. From this, it follows
that (X + Mn ◦ T )1(Mn >0) ≥ max{S1 , . . . , Sn }1(Mn >0) = Mn 1(Mn >0) , as desired.
38 Ergodicity and stationarity

Theorem 2.1.7 (Birkhoff-Khinchin ergodic theorem). Let p ≥ 1, let X be a variable with

p’th moment and let T be a mapping which is measure preserving and ergodic. It then holds
Pn
that limn→∞ n1 k=1 X ◦ T k−1 = EX almost surely and in Lp .

Pn
Proof. We first consider the case where X has mean zero. Define Sn = k=1 X ◦ T k−1 . We
need to show that limn→∞ n1 Sn = 0 almost surely and in Lp . Put Y = lim supn→∞ n1 Sn , we
will show that almost surely, Y ≤ 0. If we can obtain this, a symmetry argument will then
allow us to obtain the desired conclusion. In order to prove that Y ≤ 0 almost surely, we
first take ε > 0 and show that P (Y > ε) = 0. To this end, we begin by noting that
n
1 1X 1
Y ◦ T = lim sup (Sn ◦ T ) = lim sup X ◦ T k = lim sup (Sn+1 − X) = Y,
n→∞ n n→∞ n n→∞ n
k=1

so Y is T -invariant. Therefore, by Lemma 2.1.5, (Y > ε) is in IT . As T is ergodic, it therefore

holds that P (Y > ε) either is zero or one, our objective is to show that the probability in fact
is zero. We will obtain this by applying Lemma 2.1.6 to a suitably chosen random variable.
Pn
Define X 0 : Ω → R by X 0 = (X − ε)1(Y >ε) and put Sn0 = k=1 X 0 ◦ T k−1 , we will eventually
apply Lemma 2.1.6 to X 0 . First note that
n
X n
X
Sn0 = X 0 ◦ T k−1 = (X − ε)1(Y >ε) ◦ T k−1
k=1 k=1
Xn n
X
k−1
= X1(Y >ε) ◦ T −ε 1(Y >ε) ◦ T k−1
k=1 k=1
n
!
X
k−1
= 1(Y >ε) X ◦T − nε = 1(Y >ε) (Sn − nε),
k=1

allowing us to conclude

(Y > ε) = (Y > ε) ∩ (supn≥1 n1 Sn > ε) = (Y > ε) ∩ ∪∞ 1

n=1 ( n Sn > ε)

= ∪∞ ∞ 0
n=1 (Y > ε) ∩ (Sn − nε > 0) = ∪n=1 (Sn > 0)

= ∪∞ 1 0 1 0
n=1 ( n Sn > 0) = (supn≥1 n Sn > 0). (2.1)

This relates the event (Y > ε) to the sequence (Sn0 ). Applying Lemma 2.1.6 and recalling
(2.1), we obtain E1(Y >ε) X 0 ≥ 0, which implies

εP (Y > ε) ≤ E1(Y >ε) X. (2.2)

Finally, recall that by ergodicity of T , P (Y > ε) is either zero or one. If P (Y > ε) is one,
(2.2) yields ε ≤ 0, a contradiction. Therefore, we must have that P (Y > ε) is zero. We now
use this to complete the proof of almost sure convergence. As P (Y > ε) is zero for all ε > 0,
2.1 Measure preservation, invariance and ergodicity 39

we conclude that P (Y > 0) = P (∪∞

n=1 (Y >
1
n )) = 0, so lim supn→∞ n1 Sn = Y ≤ 0 almost
surely. Next note that
n n
1 1X 1X
− lim inf Sn = − lim inf X ◦ T k−1 = lim sup (−X) ◦ T k−1 ,
n→∞ n n→∞ n n→∞ n
k=1 k=1

so applying the same result with −X instead of X, we also obtain − lim inf n→∞ n1 Sn ≤ 0
almost surely. All in all, this shows that 0 ≤ lim inf n→∞ n1 Sn ≤ lim supn→∞ n1 Sn ≤ 0
almost surely, so limn→∞ n1 Sn = 0 almost surely, as desired. Finally, considering the case
where EX is nonzero, we may use our previous result with the variable X − EX to obtain
Pn Pn
limn→∞ n1 k=1 X ◦ T k−1 = limn→∞ n1 k=1 (X − EX) ◦ T k−1 + EX = EX, completing the
proof of almost sure convergence in the general case.

It remains to show convergence in Lp , meaning that we wish to prove the convergence of

Pn
E|EX − n1 k=1 X ◦ T k−1 |p to zero as n tends to infinity. With k · kp denoting the seminorm
on Lp (Ω, F, P ) given by kXkp = (EX p )1/p , we will show that kEX − n1 Sn kp tends to zero.
Pn
To this end, we fix m ≥ 1 and define X 0 = X1(|X|≤m) and Sn0 = k=1 X 0 ◦ T k−1 . By the
triangle inequality, we obtain

kEX − n1 Sn kp ≤ kEX − EX 0 kp + kEX 0 − n1 Sn0 kp + k n1 Sn0 − n1 Sn kp .

We consider the each of the three terms on the right-hand side. For the first term, it holds
that kEX − EX 0 kp = |EX − EX 0 | = |EX1(|X|>m) | ≤ E|X|1(|X|>m) . As for the sec-
ond term, the results already proven show that n1 Sn0 converges almost surely to EX 0 . As
Pn
|Sn0 | = | n1 k=1 X 0 ◦ T k−1 | ≤ m, the dominated convergence theorem allows us to con-
clude limn→∞ E|X 0 − n1 Sn0 |p = E limn→∞ |X 0 − n1 Sn0 |p = 0, which implies that we have
limn→∞ kEX 0 − n1 Sn kp = 0. Finally, we may apply the triangle inequality and the measure
preservation property of T to obtain
n n n
1X 0 1X 1X
k n1 Sn0 − n1 Sn kp = X ◦ T k−1 − X ◦ T k−1 ≤ kX ◦ T k−1 − X 0 ◦ T k−1 kp
n n n
k=1 k=1 p k=1
n Z 1/p n Z 1/p
1 X
0 k−1 p 1X 0 p
= |(X − X ) ◦ T | dP = |(X − X )| dP
n n
k=1 k=1

= kX − X 0 kp = (E|X|p 1(|X|>m) )1/p .

Combining these observations, we obtain

lim sup kEX − n1 Sn kp ≤ lim sup kEX − EX 0 kp + kEX 0 − n1 Sn0 kp + k n1 Sn0 − n1 Sn kp

n→∞ n→∞

≤ E|X|1(|X|>m) + (E|X|p 1(|X|>m) )1/p (2.3)

40 Ergodicity and stationarity

By the dominated convergence theorem, both of these terms tend to zero as m tends to
infinity. As the bound in (2.3) holds for all m, we conclude lim supn→∞ kEX − n1 Sn kp = 0,
which yields convergence in Lp .

Theorem 2.1.7 shows that for any variable X with p’th moment and any measure preserving
and ergodic transformation T , a version of the strong law of large numbers holds for the
Pn
process (X ◦ T k−1 )k≥1 in the sense that n1 k=1 X ◦ T k−1 converges almosts surely and in
Lp to EX. Note that in this case, the measure preservation property of T shows that X and
X ◦ T k−1 have the same distribution for all k ≥ 1. Therefore, Theorem 2.1.7 is a type of law
of large numbers for processes of identical, but not necessarily independent variables.

2.2 Criteria for measure preservation and ergodicity

To apply Theorem 2.1.7, we need to be able to show measure preservation and ergodicity. In
this section, we prove some sufficient criteria which will help make this possible in practical
cases. Throughout this section, T denotes a measurable mapping from Ω to Ω.

First, we consider a simple lemma showing that in order to prove that T is measure preserv-
ing, it suffices to check the claim only for a generating family which is stable under finite
intersections.

Lemma 2.2.1. Let H be a generating family for F which is stable under finite intersections.
If P (T −1 (F )) = P (F ) for all F ∈ H, then T is P -measure preserving.

Proof. As both P and T (P ) are probability measures, this follows from the uniqueness the-
orem for probability measures.

Next, we consider the somewhat more involved problem of showing that a measure preserving
mapping is ergodic. A simple first result is the following.

Theorem 2.2.2. Let T be measure preserving. Then T is ergodic if and only if every
invariant random variable is constant almost surely.

Proof. First assume that T is ergodic. Let X be an invariant random variable. By Lemma
2.1.5, X is IT measurable, so in particular (X ≤ x) ∈ IT for all x ∈ R. As T is ergodic, all
2.2 Criteria for measure preservation and ergodicity 41

events in IT have probability zero or one, so we find that P (X ≤ x) is zero or one for all
x ∈ R.

We claim that this implies that X is constant almost surely. To this end, we define c by
putting c = sup{x ∈ R | P (X ≤ x) = 0}. As we cannot have P (X ≤ x) = 1 for all x ∈ R,
{x ∈ R | P (X ≤ x) = 0} is nonempty, so c is not minus infinity. And as we cannot have
P (X ≤ x) = 0 for all x ∈ R, {x ∈ R | P (X ≤ x) = 0} is not all of R. As x 7→ P (X ≤ x)
is increasing, this implies that {x ∈ R | P (X ≤ x) = 0} is bounded from above, so c is not
infinity. Thus, c is finite.

Now, by definition, c is the least upper bound of the set {x ∈ R | P (X ≤ x) = 0}. Therefore,
any number strictly smaller than c is not an upper bound. From this we conclude that for
n ≥ 1, there is cn with c − n1 < cn such that P (X ≤ cn ) = 0. Therefore, we must also have
P (X ≤ c − n1 ) ≤ P (X ≤ cn ) = 0, and so P (X < c) = limn→∞ P (X ≤ c − n1 ) = 0. On the
other hand, as c is an upper bound for the set {x ∈ R | P (X ≤ x) = 0}, it holds for any
ε > 0 that P (X ≤ c + ε) 6= 0, yielding that for all ε > 0, P (X ≤ c + ε) = 1. Therefore,
P (X ≤ c) = limn→∞ P (X ≤ c + n1 ) = 1. All in all, we conclude P (X = c) = 1, so X is
constant almost surely. This proves the first implication of the theorem.

Next, assume that every invariant random variable is constant almost surely, we wish to
prove that T is ergodic. Let F ∈ IT , we have to show that P (F ) is either zero or one. Note
that 1F is IT measurable and so invariant by Lemma 2.1.5. Therefore, by our assumption,
1F is almost surely constant, and this implies that P (F ) is either zero or one. This proves
the other implication and so concludes the proof.

Theorem 2.2.2 is occasionally useful if the T -invariant random variables are easy to charac-
terize. The following theorem shows a different avenue for proving ergodicity based on a sort
of asymptotic independence criterion.

Theorem 2.2.3. Let T be P -measure preserving. T is ergodic if and only if it holds for all
Pn
F, G ∈ F that limn→∞ n1 k=1 P (F ∩ T −(k−1) (G)) = P (F )P (G).

Proof. First assume that T is ergodic. Fix F, G ∈ F. Applying Theorem 2.1.7 with the inte-
Pn
grable variable 1G , and noting that 1G ◦ T k−1 = 1T −(k−1) (G) , we find that n1 k=1 1T −(k−1) (G)
n
converges almost surely to P (G). Therefore, n1 k=1 1F 1T −(k−1) (G) converges almost surely
P

to 1F P (G). As this sequence of variables is bounded, the dominated convergence theorem

42 Ergodicity and stationarity

yields
n n
1X 1X
lim P (F ∩ T −(k−1) (G)) = lim E 1F 1T −(k−1) (G)
n→∞ n n→∞ n
k=1 k=1
n
1X
= E lim 1F 1T −(k−1) (G)
n→∞ n
k=1

= E1F P (G) = P (F )P (G),

proving the first implication. Next, we consider the other implication. Assume that for all
Pn
F, G ∈ F, limn→∞ n1 k=1 P (F ∩ T −(k−1) (G)) = P (F )P (G). We wish to show that T is
ergodic. Let F ∈ IT , we then obtain
n
1X
P (F ) = lim P (F ∩ T −(k−1) (F )) = P (F )2 ,
n→∞ n
k=1

so that P (F ) is either zero or one, and thus T is ergodic.

Definition 2.2.4. If limn→∞ P (F ∩ T −n (G)) = P (F )P (G) for all F, G ∈ F, we say that

Pn
T is mixing. If limn→∞ n1 k=1 |P (F ∩ T −(k−1) (G)) − P (F )P (G)| = 0 for all F, G ∈ F, we
say that T is weakly mixing.

Theorem 2.2.5. Let T be measure preserving. If T is mixing, then T is weakly mixing. If

T is weakly mixing, then T is ergodic.

Proof. First assume that T is mixing. Let F, G ∈ F. As T is mixing, we find that

limn→∞ P (F ∩ T −n (G)) = P (F )P (G), and so limn→∞ |P (F ∩ T −n (G)) − P (F )P (G)| = 0.
As convergence of a sequence implies convergence of the averages, this implies that we have
Pn
limn→∞ n1 k=1 |P (F ∩ T −(k−1) (G)) − P (F )P (G)| = 0, so T is weakly mixing. Next, as-
sume that T is weakly mixing, we wish to show that T is ergodic. Let F, G ∈ F. As
Pn
limn→∞ n1 k=1 |P (F ∩ T −(k−1) (G)) − P (F )P (G)| = 0, we also obtain
n
1X
lim sup P (F ∩ T −(k−1) (G)) − P (F )P (G)
n→∞ n
k=1
n
1X
≤ lim sup |P (F ∩ T −(k−1) (G)) − P (F )P (G)|,
n→∞ n
k=1
Pn
which is zero, so limn→∞ n1 k=1 P (F ∩T −(k−1) (G)) = P (F )P (G), and Theorem 2.2.3 shows
that T is ergodic. This proves the theorem.
2.2 Criteria for measure preservation and ergodicity 43

Lemma 2.2.6. Let T be measure preserving, and let H be a generating family for F which
is stable under finite intersections. Assume that one of the following holds:

Pn
(1). limn→∞ 1
n k=1 P (F ∩ T −(k−1) (G)) = P (F )P (G) for all F, G ∈ H.
Pn
(2). limn→∞ 1
n k=1 |P (F ∩ T −(k−1) (G)) − P (F )P (G)| = 0 for all F, G ∈ H.

(3). limn→∞ P (F ∩ T −n (G)) = P (F )P (G) for all F, G ∈ H.

Then, the corresponding statement also holds for all F, G ∈ F.

Proof. The proofs for the three cases are similar, so we only argue that the third claim holds.
Fix F ∈ H and define
n o
D = G ∈ F lim P (F ∩ T −n (G)) = P (F )P (G) .
n→∞

We wish to argue that D is a Dynkin class. To this end, note that since T −1 (Ω) = Ω, it holds
that Ω ∈ D. Take A, B ∈ D with A ⊆ B. We then also have T −n (A) ⊆ T −n (B), yielding

lim P (F ∩ T −n (B \ A)) = lim P (F ∩ (T −n (B) \ T −n (A)))

n→∞ n→∞

= lim P (F ∩ T −n (B)) − lim P (F ∩ T −n (A))

n→∞ n→∞

= P (F )P (B) − P (F )P (A) = P (F )P (B \ A),

and so B \ A ∈ D. Finally, let (An ) be an increasing sequence in D and let A = ∪∞n=1 An . As

limm→∞ P (Am ) = P (A), we obtain limm→∞ P (A \ Am ) = 0. Pick ε > 0 and let m be such
that for i ≥ m, P (A \ Ai ) ≤ ε. Note that T −n (Ai ) ⊆ T −n (A). As T is measure preserving,
we obtain for all n ≥ 1 and i ≥ m that

0 ≤ P (F ∩ T −n (A)) − P (F ∩ T −n (Ai ))
= P (F ∩ (T −n (A) \ T −n (Ai ))) = P (F ∩ (T −n (A \ Ai )))
≤ P (T −n (A \ Ai )) = P (A \ Ai ) ≤ ε.

From this we find that for all n ≥ 1 and i ≥ m,

P (F ∩ T −n (Ai )) − ε ≤ P (F ∩ T −n (A)) ≤ P (F ∩ T −n (Ai )) + ε,

and therefore, for i ≥ m,

P (F )P (Ai ) − ε = lim P (F ∩ T −n (Ai )) − ε ≤ lim inf P (F ∩ T −n (A))

n→∞ n→∞
−n
≤ lim sup P (F ∩ T (A)) ≤ lim P (F ∩ T −n (Ai )) + ε
n→∞ n→∞

= P (F )P (Ai ) + ε. (2.4)
44 Ergodicity and stationarity

As (2.4) holds for all i ≥ m, we in particular conclude that

P (F )P (A) − ε = lim P (F )P (Ai ) − ε ≤ lim inf P (F ∩ T −n (A))

i→∞ n→∞
−n
≤ lim sup P (F ∩ T (A)) = lim P (F )P (Ai ) + ε
n→∞ i→∞

= P (F )P (A) + ε. (2.5)

And as ε > 0 is arbitrary in (2.5), we conclude limn→∞ P (F ∩ T −n (A)) = P (F )P (A), so that

A ∈ D. Thus, D is a Dynkin class. By assumption, D contains H, and therefore, by Dynkin’s
lemma, F = σ(H) ⊆ D. This proves limn→∞ P (F ∩ T −n (G)) = P (F )P (G) when F ∈ H and
G ∈ F. Next, we extend this to all F ∈ F. To do so, fix G ∈ F and define
n o
E = F ∈ F lim P (F ∩ T −n (G)) = P (F )P (G) .
n→∞

Similarly to our earlier arguments, we find that E is a Dynkin class. As E contains H,

Dynkin’s lemma yields F = σ(H) ⊆ E. This shows limn→∞ P (F ∩ T −n (G)) = P (F )P (G)
when F, G ∈ F and so proves the first claim. By similar arguments, we obtain the two
remaining claims.

Combining Theorem 2.2.5 and Lemma 2.2.6, we find that in order to show ergodicity of T ,
it suffices to show that T is mixing or weakly mixing for events in a generating system for F
which is stable under finite intersections. This is in several cases a viable method for proving
ergodicity.

2.3 Stationary processes and the law of large numbers

We will now apply the results from Section 2.1 and Section 2.2 to obtain laws of large numbers
for the class of processes known as stationary processes. In order to do so, we first need to
investigate in what sense we can consider the simultaneous distribution of an entire process
(Xn ). Once we have done so, we will be able to obtain our main results by applying the
ergodic theorem to this simultaneous distribution.

The results require some formalism. By Rn for n ≥ 1, we denote the n-fold product of R, the
set of n-tuples with elements from R. Analogously, we define R∞ as the set of all sequences
of real numbers, in the sense that R∞ = {(xn )n≥1 | xn ∈ R for all n ≥ 1}. Recall that
the Borel σ-algebra on Rn , defined as the smallest σ-algebra containing all open sets, also
is given as the smallest σ-algebra making all coordinate projections measurable. In analogy
2.3 Stationary processes and the law of large numbers 45

with this, we make the following definition of the Borel σ-algebra on R∞ . By X̂n : R∞ → R,
we denote the n’th coordinate projection of R∞ , X̂n (x) = xn , where x = (xn )n≥1 .

Definition 2.3.1. The infinite-dimensional Borel σ-algebra, B∞ , is the smallest σ-algebra

making X̂n measurable for all n ≥ 1.

In detail, Definition 2.3.1 states the following. Let A be the family of all σ-algebras G on R∞
such that for all n ≥ 1, X̂n is G-B measurable. B∞ is then the smallest σ-algebra in the set
A of σ-algebras, explicitly constructed as B∞ = ∩G∈A G.

In the following lemmas, we prove some basic results on the measure space (R∞ , B∞ ). In
Lemma 2.3.2, a generating family which is stable under finite intersections is identified, and
in Lemma 2.3.3, the mappings which are measurable with respect to B∞ are identified. In
Lemma 2.3.4, we show how we can apply B∞ to describe and work with stochastic processes.

Lemma 2.3.2. Let K be a generating family for B which is stable under finite intersec-
tions. Define H as the family of sets {x ∈ R∞ | x1 ∈ B1 , . . . , xn ∈ Bn }, where n ≥ 1 and
B1 ∈ K, . . . , Bn ∈ K. H is then a generating family for B∞ which is stable under finite
intersections.

Proof. It is immediate that H is stable under finite intersections. Note that if F is a set such
that F = {x ∈ R∞ | x1 ∈ B1 , . . . , xn ∈ Bn } for some n ≥ 1 and B1 ∈ K, . . . , Bn ∈ K, we also
have

F = {x ∈ R∞ | x1 ∈ B1 , . . . , xn ∈ Bn }
= {x ∈ R∞ | X̂1 (x) ∈ B1 , . . . , X̂n (x) ∈ Bn }
= {x ∈ R∞ | x ∈ X̂1−1 (B1 ), . . . , x ∈ X̂n−1 (Bn )}
= ∩nk=1 X̂n−1 (Bk ).

Therefore, H ⊆ B∞ , and so σ(H) ⊆ B∞ . It remains to argue that B∞ ⊆ σ(H). To this

end, fix n ≥ 1 and note that H contains X̂n−1 (B) for all B ∈ K. Therefore, σ(H) contains
X̂n−1 (B) for all B ∈ K. As {B ∈ B | X̂n−1 (B) ∈ σ(H)} is a σ-algebra which contains K, we
conclude that it also contains B. Thus, σ(H) contains X̂n−1 (B) for all B ∈ B, and so σ(H)
is a σ-algebra on R∞ making all coordinate projections measurable. As B∞ is the smallest
such σ-algebra, we conclude B∞ ⊆ σ(H). All in all, we obtain B∞ = σ(H), as desired.

Lemma 2.3.3. Let X : Ω → R∞ . X is F-B∞ measurable if and only if X̂n ◦ X is F-B

measurable for all n ≥ 1.
46 Ergodicity and stationarity

Proof. First assume that X is F-B∞ measurable. As X̂n is B∞ -B measurable by definition,

we find that X̂n ◦ X is F-B measurable. Conversely, assume that X̂n ◦ X is F-B measurable
for all n ≥ 1, we wish to show that X is F-B∞ measurable. To this end, it suffices to
show that X −1 (A) ∈ F for all A in a generating family for B∞ . Define H by putting
H = {X̂n−1 (B) | n ≥ 1, B ∈ B}, H is then a generating family for B∞ . For any n ≥ 1 and
B ∈ B, we have X −1 (X̂n−1 (B)) = (X̂n ◦ X)−1 (B) ∈ F by our assumptions. Thus, X is F-B∞
measurable, as was to be proven.

Lemma 2.3.4. Let (Xn ) be a stochastic process. Defining a mapping X : Ω → R∞ by putting

X(ω) = (Xn (ω))n≥1 , it holds that X is F-B∞ measurable.

Proof. As X̂n ◦ X = Xn and Xn is F-B measurable by assumption, the result follows from
Lemma 2.3.3.

Letting (Xn )n≥1 be a stochastic process, Lemma 2.3.4 shows that with X : Ω → R∞ defined
by X(ω) = (Xn (ω))n≥1 , X is F-B∞ measurable, and therefore, the image measure X(P )
is well-defined. This motivates the following definition of the distribution of a stochastic
process.

Definition 2.3.5. Letting (Xn )n≥1 be a stochastic process. The distribution of (Xn )n≥1 is
the probability measure X(P ) on B∞ .

Utilizing the above definitions and results, we can now state our plan for the main results
to be shown later in this section. Recall that one of our goals for this section is to prove an
extension of the law of large numbers. The method we will apply is the following. Consider
a stochastic process (Xn ). The introduction of the infinite-dimensional Borel-σ-algebra and
the measurability result in Lemma 2.3.4 have allowed us in Definition 2.3.5 to introduce the
concept of the distribution of a process. In particular, we have at our disposal a probability
space (R∞ , B∞ , X(P )). If we can identify a suitable transformation T : R∞ → R∞ such that
T is measure preserving and ergodic for X(P ), we will be able to apply Theorem 2.1.7 to
obtain a type of law of large numbers with X(P ) almost sure convergence and convergence in
Lp (R∞ , B∞ , X(P )). If we afterwards succeed in transfering the results from the probability
space (R∞ , B∞ , X(P )) back to the probability space (Ω, F, P ), we will have achieved our
goal.

Lemma 2.3.6. Let (Xn ) be a stochastic process. Define X : Ω → R∞ by X(ω) = (Xn (ω))n≥1 .
The image measure X(P ) is the unique probability measure on B∞ such that for all n ≥ 1
2.3 Stationary processes and the law of large numbers 47

and all B1 ∈ B, . . . , Bn ∈ B, it holds that

P (X1 ∈ B1 , . . . , Xn ∈ Bn ) = X(P )(∩nk=1 X̂k−1 (Bk )). (2.6)

Proof. Uniqueness follows from Lemma 2.3.2 and the uniqueness theorem for probability
measures. It remains to show that X(P ) satisfies (2.6). To this end, we note that

X(P )(∩nk=1 X̂k−1 (Bk )) = P (X −1 (∩nk=1 X̂k−1 (Bk ))) = P (∩nk=1 X −1 (X̂k−1 (Bk )))
= P (∩nk=1 (X̂k ◦ X)−1 (Bk ))) = P (∩nk=1 (Xk ∈ Bk ))
= P (X1 ∈ B1 , . . . , Xn ∈ Bn ).

This completes the proof.

Lemma 2.3.6 may appear rather abstract at a first glance. A clearer statement might be
obtained by noting that ∩nk=1 X̂k−1 (Bk ) = B1 × · · · × Bk × R∞ . The lemma then states that
the distribution X(P ) is the only probability measure on B∞ such that the X(P )-measure
of a “finite-dimensional rectangle” of the form B1 × · · · × Bk × R∞ has the same measure
as P (X1 ∈ B1 , . . . , Xn ∈ Bn ), a property reminiscent of the characterizing feature of the
distribution of an ordinary finite-dimensional random variable.

Using the above, we may now formalize the notion of a stationary process. First, we define
θ : R∞ → R∞ by putting θ((xn )n≥1 ) = (xn+1 )n≥1 . We refer to θ as the shift operator.
Note that by Lemma 2.3.3, θ is B∞ -B∞ measurable. The mapping θ will play the role of the
measure preserving and ergodic transformation in our later use of Theorem 2.1.7.

Definition 2.3.7. Let (Xn ) be a stochastic process. We say that (Xn ) is a stationary process,
or simply stationary, if it holds that θ is measure preserving for the distribution of (Xn ). We
say that a stationary process is ergodic if θ is ergodic for the distribution of (Xn ).

According to Definition 2.3.7, the property of being stationary is related to the measure
preservation property of the mapping θ on B∞ in relation to the measure X(P ) on B∞ , and
the property of being ergodic is related to the invariant σ-algebra of θ, which is a sub-σ-
algebra of B∞ . It is these conceptions of stationarity and ergodicity we will be using when
formulating our laws of large numbers. However, for practical use, it is convenient to be able
to express stationarity and ergodicity in terms of the probability space (Ω, F, P ) instead of
(R∞ , B∞ , X(P )). The following results will allow us to do so.

Lemma 2.3.8. Let (Xn ) be a stochastic process. The following are equivalent.
48 Ergodicity and stationarity

(1). (Xn ) is stationary.

(2). (Xn )n≥1 and (Xn+1 )n≥1 have the same distribution.

(3). For all k ≥ 1, (Xn )n≥1 and (Xn+k )n≥1 have the same distribution.

Proof. We first prove that (1) implies (3). Assume that (Xn ) is stationary and fix k ≥ 1.
Define a process Y by setting Y = (Xn+k )n≥1 , we then also have Y = θk ◦ X. As (Xn ) is
stationary, θ is X(P )-measure preserving. By an application of Theorem A.2.13, this yields
Y (P ) = (θk ◦ X)(P ) = θk (X(P )) = X(P ), showing that (Xn )n≥1 and (Xn+k )n≥1 have the
same distribution, and so proving that (1) implies (3).

As it is immediate that (3) implies (2), we find that in order to complete the proof, it suffices
to show that (2) implies (1). Therefore, assume that (Xn )n≥1 and (Xn+1 )n≥1 have the same
distribution, meaning that X(P ) and Y (P ) are equal, where Y = (Xn+1 )n≥1 . We then obtain
θ(X(P )) = (θ ◦ X)(P ) = Y (P ) = X(P ), so θ is X(P )-measure preserving. This proves that
(2) implies (1), as desired.

An important consequence of Lemma 2.3.8 is the following.

Lemma 2.3.9. Let (Xn ) be a stationary stochastic process. For all k ≥ 1 and n ≥ 1,
(X1 , . . . , Xn ) has the same distribution as (X1+k , . . . , Xn+k ).

Proof. Fix k ≥ 1 and n ≥ 1. Let Y = (Xn+k )n≥1 . By Lemma 2.3.8, it holds that (Xn )n≥1
and (Yn )n≥1 have the same distribution. Let ϕ : R∞ → Rn denote the projection onto the
first n coordinates of R∞ . Using Theorem A.2.13, we then obtain

(X1 , . . . , Xn )(P ) = ϕ(X)(P ) = ϕ(X(P )) = ϕ(Y (P ))

= ϕ(Y )(P ) = (Y1 , . . . , Yn )(P ) = (X1+k , . . . , Xn+k )(P ),

proving that (X1 , . . . , Xn ) has the same distribution as (X1+k , . . . , Xn+k ), as was to be
shown.

Next, we consider a more convenient formulation of ergodicity for a stationary process.

Definition 2.3.10. Let (Xn ) be a stationary process. The invariant σ-algebra I(X) for the
process is defined by I(X) = {X −1 (B) | B ∈ B∞ , B is invariant for θ}.
2.3 Stationary processes and the law of large numbers 49

Lemma 2.3.11. Let (Xn ) be a stationary process. (Xn ) is ergodic if and only if it holds that
for all F ∈ I(X), P (F ) is either zero or one.

Proof. First assume that (Xn ) is ergodic, meaning that θ is ergodic for X(P ). This means
that with Iθ denoting the invariant σ-algebra for θ on B∞ , X(P )(B) is either zero or one for
all B ∈ Iθ . Now let F ∈ I(X), we then have F = (X ∈ B) for some B ∈ Iθ , so we obtain
P (F ) = P (X ∈ B) = X(P )(B), which is either zero or one. This proves the first implication.
Next, assume that for all F ∈ I(X), P (F ) is either zero or one. We wish to show that (Xn )
is ergodic. Let B ∈ Iθ . We then obtain X(P )(B) = P (X −1 (B)), which is either zero or one
as X −1 (B) ∈ I(X). Thus, (Xn ) is ergodic.

Lemma 2.3.8 and Lemma 2.3.11 shows how to reformulate the definitions in Definition 2.3.7
more concretely in terms of the probability space (Ω, F, P ) and the process (Xn ). We are now
ready to use the ergodic theorem to obtain a law of large numbers for stationary processes.

Theorem 2.3.12 (Ergodic theorem for ergodic stationary processes). Let (Xn ) be an ergodic
stationary process, and let f : R∞ → R be some B∞ -B measurable mapping. If f ((Xn )n≥1 )
Pn
has p’th moment, n1 k=1 f ((Xi )i≥k ) converges almost surely and in Lp to Ef ((Xi )i≥1 ).

Proof. We first investigate what may be obtained by using the ordinary ergodic theorem of
Theorem 2.1.7. Let P̂ = X(P ), the distribution of (Xn ). By our assumptions, θ is P̂ -measure
preserving and ergodic. Also, f is a random variable on the probability space (R∞ , B∞ , P̂ ),
and
Z Z Z
p
|f | dP̂ = |f | dX(P ) = |f ◦ X|p dP = E|f (X)|p = E|f ((Xn )n≥1 )|p ,
p

which is finite by our assumptions. Thus, considered as a random variable on (R∞ , B∞ , P̂ ),

f has p’th moment. Letting µ = Ef ((Xn )n≥1 ), Theorem 2.1.7 yields
n Z
1X
lim f ◦ θk−1 = f dP̂ = Ef ((Xn )n≥1 ) = µ,
n→∞ n
k=1

in the sense of P̂ almost sure convergence and convergence in Lp (R∞ , B∞ , P̂ ). These are limit
results on the probability space (R∞ , B∞ , P̂ ). We would like to transfer these results to our
original probability space (Ω, F, P ). We first consider the case of almost sure convergence.
Pn
We wish to argue that n1 k=1 f ((Xi )i≥k ) converges P -almost surely to µ. To do so, first
50 Ergodicity and stationarity

note that
n
! ( n
)
1X 1X
f ((Xi )i≥k ) converges to µ = ω ∈ Ω lim f ((Xi )i≥k (ω)) = µ
n n→∞ n
k=1 k=1
( n
)
1X
= ω ∈ Ω lim f ((Xi (ω))i≥k ) = µ
n→∞ n
k=1
( n
)
1X k−1
= ω ∈ Ω lim f (θ (X(ω))) = µ ,
n→∞ n
k=1

and this final set is equal to X −1 (A), where

( n
)
1X
A= x ∈ R∞ lim f (θk−1 (x)) = µ ,
n→∞ n
k=1

Pn
or, with our usual probabilistic notation, A = (limn→∞ n1 k=1 f ◦ θk−1 = µ). Therefore, we
obtain
n
!
1X
P f ((Xi )i≥k ) converges to µ = P (X −1 (A)) = X(P )(A)
n
k=1
n
!
1X k−1
= P̂ (A) = P̂ lim f ◦θ =µ ,
n→∞ n
k=1

Pn
and the latter is equal to one by the P̂ -almost sure convergence of n1 k=1 f ◦ θk−1 to µ. This
n
proves P -almost sure convergence of n1 k=1 f ((Xi )i≥k ) to µ. Next, we consider convergence
P
n
in Lp . Here, we need limn→∞ E|µ − n1 k=1 f ((Xi )i≥k )|p = 0. To obtain this, we note that
P

for any ω ∈ Ω, it holds that

n p ! n p
1X 1X
µ− f ((Xi )i≥k ) (ω) = µ − f ((Xi (ω))i≥k )
n n
k=1 k=1
n p
1X
= µ− f (θk−1 (X(ω)))
n
k=1
n p!
1X k−1
= µ− f ◦θ (X(ω)),
n
k=1

yielding

n p n p !
1X 1X
µ− f ((Xi )i≥k ) = µ− f ◦ θk−1 ◦ X,
n n
k=1 k=1
2.3 Stationary processes and the law of large numbers 51

and so the change-of-variables formula allows us to conclude that

n p Z n p
1X 1X
lim E µ − f ((Xi )i≥k ) = lim µ− f ((Xi )i≥k ) dP
n→∞ n n→∞ n
k=1 k=1
Z n p!
1X
= lim µ− f ((X̂i )i≥k ) ◦ X dP
n→∞ n
k=1
Z n p!
1X
= lim µ− f ((X̂i )i≥k ) dX(P )
n→∞ n
k=1
Z n p!
1X
= lim µ− f ((X̂i )i≥k ) dP̂ = 0,
n→∞ n
k=1
Pn
by the Lp (R∞ , B∞ , P̂ )-convergence of n1 k=1 f ((X̂i )i≥k ) to µ. This demonstrates the desired
convergence in Lp and so concludes the proof of the theorem.

Theorem 2.3.12 is the main theorem of this section. As the following corollary shows, a
simpler version of the theorem is obtained by applying the theorem to a particular type of
function from R∞ to R.
Corollary 2.3.13. Let (Xn ) be an ergodic stationary process, and let f : R → R be some
Pn
Borel measurable mapping. If f (X1 ) has p’th moment, n1 k=1 f (Xk ) converges almost surely
and in Lp to Ef (X1 ).

Proof. Define g : R∞ → R by putting g((xn )n≥1 ) = f (x1 ). Then g = f ◦ X̂1 , so as f is B-B

measurable and X̂1 is B∞ -B measurable, g is B∞ -B measurable. Also, g((Xi )i≥1 ) = f (X1 ),
which has p’th moment by assumption. Therefore, Theorem 2.3.12 allows us to conclude that
1
Pn p
n k=1 g((Xi )i≥k ) converges almost surely and in L to Ef (X1 ). As g((Xi )i≥k ) = f (Xk ),
this yields the desired conclusion.

Theorem 2.3.12 and Corollary 2.3.13 yields powerful convergence results for stationary and
ergodic processes. Next, we show that our results contain the strong law of large numbers
for independent and identically distributed variables as a special case. In addition, we also
obtain Lp convergence of the empirical means. To show this result, we need to prove that
sequences of independent and identically distributed variables are stationary and ergodic.
Corollary 2.3.14. Let (Xn ) be a sequence of independent, identically distributed variables.
Then (Xn ) is stationary and ergodic. Assume furthermore that Xn has p’th moment for some
Pn
p ≥ 1, and let µ be the common mean. Then n1 k=1 Xk converges to µ almost surely and
in Lp .
52 Ergodicity and stationarity

Proof. We first show that (Xn ) is stationary. Let ν denote the common distribution of the
Xn . Let X = (Xn )n≥1 and Y = (Xn+1 )n≥1 . Fix n ≥ 1 and B1 , . . . , Bn ∈ B, we then obtain
n
Y
Y (P )(∩nk=1 X̂k−1 (Bk )) = P (X2 ∈ B1 , . . . , Xn+1 ∈ Bn ) = ν(Bi )
i=1

= P (X1 ∈ B1 , . . . , Xn ∈ Bn ) = X(P )(∩nk=1 X̂k−1 (Bk )),

so by Lemma 2.3.2 and the uniqueness theorem for probability measures, we conclude that
Y (P ) = X(P ), and thus (Xn ) is stationary. Next, we show that (Xn ) is ergodic. Let I(X)
denote the invariant σ-algebra for (Xn ), and let J denote the tail-σ-algebra for (Xn ), see
Definition 1.3.9. Let F ∈ I(X), we then have F = (X ∈ B) for some B ∈ Iθ , where Iθ is
the invariant σ-algebra on R∞ for the shift operator. Therefore, for any n ≥ 1, we obtain

(X ∈ B) = (X ∈ θ−n (B)) = (θn (X) ∈ B)

= ((Xn+1 , Xn+2 , . . .) ∈ B) ∈ σ(Xn+1 , Xn+2 , . . .). (2.7)

As n is arbitrary in (2.7), we conclude (X ∈ B) ∈ J , and as a consquence, I(X) ⊆ J . Now

recalling from Theorem 1.3.10 that P (F ) is either zero or one for all F ∈ J , we obtain that
whenever F ∈ I(X), P (F ) is either zero or one as well. By Lemma 2.3.11, this shows that
(Xn ) is ergodic. Corollary 2.3.13 yields the remaining claims of the corollary.

In order to apply Theorem 2.3.12 and Corollary 2.3.13 in general, we need results on how
to prove stationarity and ergodicity. As the final theme of this section, we show two such
results.

Lemma 2.3.15. Let (Xn ) be stationary. Assume that for all m, p ≥ 1, A1 , . . . , Am ∈ B and
B1 , . . . , Bp ∈ B:

p
(1). With F = ∩mi=1 (Xi ∈ Ai ) and Gk = ∩i=1 (Xi+k−1 ∈ Bi ) for k ≥ 1, it holds that
n
limn→∞ n1 k=1 P (F ∩ Gk ) = P (F )P (G1 ).
P

p
(2). With F = ∩mi=1 (Xi ∈ Ai ) and Gk = ∩i=1 (Xi+k−1 ∈ Bi ) for k ≥ 1, it holds that
n
limn→∞ n1 k=1 |P (F ∩ Gk ) − P (F )P (G1 )| = 0.
P

p
(3). With F = ∩m i=1 (Xi ∈ Ai ) and Gn = ∩i=1 (Xi+n ∈ Bi ) for n ≥ 1, it holds that
limn→∞ P (F ∩ Gn ) = P (F )P (G1 ).

Then (Xn ) is ergodic.

2.3 Stationary processes and the law of large numbers 53

Proof. We only prove the result in the case where the third convergence holds, as the other
two cases follow similarly. Therefore, assume that the first criterion holds, such that for all
m, p ≥ 1, A1 , . . . , Am ∈ B and B1 , . . . , Bp ∈ B, it holds that

p p
lim P (∩m m
i=1 (Xi ∈ Ai ) ∩ ∩i=1 (Xi+n ∈ Bi )) = P (∩i=1 (Xi ∈ Ai ))P (∩i=1 (Xi ∈ Bi )). (2.8)
n→∞

We wish to show that (Xn ) is ergodic. Recall that by Definition 2.3.7 that since (Xn ) is
stationary, θ is measure preserving for P̂ , where P̂ = X(P ). Also recall from Definition
2.3.7 that in order to show that (Xn ) is ergodic, we must show that θ is ergodic for P̂ . We
will apply Lemma 2.2.6 and Theorem 2.2.5 to the probability space (R∞ , B∞ , P̂ ) and the
transformation θ. Note that as θ is measure preserving for P̂ , Lemma 2.2.6 and Theorem
2.2.5 are applicable.

Define H as the family of sets {x ∈ R∞ | x1 ∈ B1 , . . . , xn ∈ Bn }, where n ≥ 1 and

B1 ∈ B, . . . , Bn ∈ B. By Lemma 2.3.2, H is then a generating family for B∞ which is
stable under finite intersections. By Lemma 2.2.6 and Theorem 2.2.5, θ is ergodic for P̂ if it
holds that for all F, G ∈ H,

lim P̂ (F ∩ θ−n (G)) = P̂ (F )P̂ (G). (2.9)

n→∞

−1
However, for any F, G ∈ H, we have that there is m, p ≥ 1 such that F = ∩m
i=1 X̂i (Ai ) and
p −1
G = ∩i=1 X̂i (Bi ), and so

P̂ (F ∩ θ−n (G)) = X(P )(∩m −1

i=1 X̂i (Ai ) ∩ θ
−n
(∩pi=1 X̂i−1 (Bi )))
−1 p −1
= X(P )(∩m
i=1 X̂i (Ai ) ∩ ∩i=1 X̂i+n (Bi ))
−1 p −1
= P (∩m
i=1 Xi (Ai ) ∩ ∩i=1 Xi+n (Bi ))
p
= P (∩m
i=1 (Xi ∈ Ai ) ∩ ∩i=1 (Xi+n ∈ Bi )),

p
and similarly, we obtain P̂ (F ) = P (∩m i=1 (Xi ∈ Ai )) and P̂ (G) = P (∩i=1 (Xi ∈ Ai )). Thus,
−1 p −1
for F, G ∈ H with F = ∩m i=1 X̂i (Ai ) and G = ∩i=1 X̂i (Bi ), (2.9) is equivalent (2.8). As
we have assumed that (2.8) holds for all m, p ≥ 1, A1 , . . . , Am ∈ B and B1 , . . . , Bp ∈ Bm , we
conclude that (2.9) holds for all F, G ∈ H. Lemma 2.2.6 then allows us to conclude that (2.9)
holds for all F, G ∈ B∞ , and Theorem 2.2.5 then allows us to conclude that θ is ergodic for
P̂ , so that (Xn ) is ergodic, as desired.

Lemma 2.3.16. Let (Xn ) be a sequence of random variables. Let ϕ : R∞ → R be measurable,

and define a sequence of random variables (Yn ) by putting Yn = ϕ(Xn , Xn+1 , . . .). If (Xn ) is
stationary, then (Yn ) is stationary. And if (Xn ) is both stationary and ergodic, then (Yn ) is
both stationary and ergodic.
54 Ergodicity and stationarity

Proof. We first derive a formal expression for the sequence (Yn ) in terms of (Xn ). Define a
mapping Φ : R∞ → R∞ by putting, for k ≥ 1, Φ((xi )i≥1 )k = ϕ((xi )i≥k ). Equivalently, we
also have Φ((xi )i≥1 )k = (ϕ ◦ θk−1 )((xi )i≥1 ). As θ is B∞ -B∞ measurable by Lemma 2.3.3
and ϕ is B∞ -B measurable, Φ has B∞ measurable coordinates, and so is B∞ -B∞ measurable,
again by Lemma 2.3.3. And we have (Yn ) = Φ((Xn )n≥1 ).

Now assume that (Xn ) is stationary. Let P̂ be the distribution of (Xn ), and let Q̂ be the
distribution of (Yn ). By Definition 2.3.7, our assumption that (Xn ) is stationary means that
θ is measure preserving for P̂ , and in order to show that (Yn ) is stationary, we must show
that θ is measure preserving for Q̂. To do so, we note that for all k ≥ 1, it holds that

θ(Φ((xi )i≥1 ))k = Φ((xi )i≥1 )k+1 = ϕ(θk ((xi )i≥1 )) = ϕ(θk−1 (θ((xi )i≥1 ))) = Φ(θ((xi )i≥1 ))k ,

which means that θ ◦ Φ = Φ ◦ θ, and so, since θ is measure preserving for P̂ ,

θ(Q̂) = θ(Φ(P̂ )) = (θ ◦ Φ)(P̂ ) = (Φ ◦ θ)(P̂ ) = Φ(P̂ ) = Q̂,

proving that θ also is measure preserving for Q̂, so (Yn ) is stationary. Next, assume that
(Xn ) is ergodic. By Definition 2.3.7, this means that all elements of the invariant σ-algebra
Iθ of θ has P̂ measure zero or one. We wish to show that (Yn ) is ergodic, which means
that we need to show that all elements of Iθ has Q̂ measure zero or one. Let A ∈ Iθ , such
that θ−1 (A) = A. We then have Q̂(A) = P̂ (Φ−1 (A)), so it suffices to show that Φ−1 (A) is
invariant for θ, and this follows as

θ−1 (Φ−1 (A)) = (Φ ◦ θ)−1 (A) = (θ ◦ Φ)−1 (A) = Φ−1 (θ−1 (A)) = Φ−1 (A).

Thus, Φ−1 (A) is invariant for θ. As θ is ergodic for P̂ , P̂ (Φ−1 (A)) is either zero or one, and
so Q̂(A) is either zero or one. Therefore, θ is ergodic for Q̂. This shows that (Yn ) is ergodic,
as desired.

We end the section with an example showing how to apply the ergodic theorem to obtain
limit results for empirical averages for a practical case of a process consisting of variables
which are not independent.

Example 2.3.17. Let (Xn ) be a sequence of independent and identically distributed

variables concentrated on {0, 1} with P (Xn = 1) = p. The elements of the sequence
(Xn Xn+1 )n≥1 then have the same distribution for each n ≥ 1, but they are not indepen-
Pn
dent. We will use the results of this section to examine the behaviour of n1 k=1 Xk Xk+1 .
By Corollary 2.3.14, (Xn ) is stationary and ergodic. Define a mapping f : R∞ → R by
putting f ((xn )n≥1 ) = x1 x2 , f is then B∞ -B measurable, and f ((Xi )i≥1 ) = X1 X2 . Noting
2.4 Exercises 55

that EX1 X2 = p2 and that X1 X2 has moments of all orders, Theorem 2.3.12 shows that
1
Pn 2 p
n k=1 Xk Xk+1 converges to p almost surely and in L for all p ≥ 1. ◦

2.4 Exercises

Exercise 2.1. Consider the probability space ([0, 1), B[0,1) , P ) where P is the Lebesgue
measure. Define T (x) = 2x − [2x] and S(x) = x + λ − [x + λ], λ ∈ R. Here, [x] is the unique
integer satisfying [x] ≤ x < [x] + 1. Show that T and S are P -measure preserving. ◦

Exercise 2.2. Define T : [0, 1) → [0, 1) by letting T (x) = x1 − [ x1 ] for x > 0 and zero
otherwise. Show that T is Borel measurable. Define P as the nonnegative measure on
([0, 1), B[0,1) ) with density t 7→ log1 2 1+t
1
with respect to the Lebesgue measure. Show that P
is a probability measure, and show that T is measure preserving for P . ◦

Exercise 2.3. Define T : [0, 1] → [0, 1] by putting T (x) = 12 x for x > 0 and one other-
wise. Show that there is no probability measure P on ([0, 1], B[0,1] ) such that T is measure
preserving for P . ◦

Exercise 2.4. Consider the probability space ([0, 1), B[0,1) , P ) where P is the Lebesgue
measure. Define T : [0, 1) → [0, 1) by T (x) = x + λ − [x + λ]. T is then P -measure preserving.
Show that if λ is rational, T is not ergodic. ◦

Exercise 2.5. Let (Ω, F, P ) be a probability space and let T be measure preserving. Let
X be an integrable random variable and assume that X ◦ T ≤ X almost surely. Show that
X = X ◦ T almost surely. ◦

Exercise 2.6. Let (Ω, F, P ) be a probability space and let T : Ω → Ω be measurable.

Assume that T is measure preserving. Show that if T 2 is ergodic, T is ergodic as well. ◦

Exercise 2.7. Give an example of a probability space (Ω, F, P ) and a measurable mapping
T : Ω → Ω such that T 2 is measure preserving but T is not measure preserving. ◦

Exercise 2.8. Let (Ω, F, P ) be a probability space and let T be measurable and measure
preserving. We may then think of T as a random variable with values in (Ω, F). Let F ∈ F.

(1). Show that (T n ∈ F i.o.) is invariant.

56 Ergodicity and stationarity

(2). Show that for n ≥ 1, (∪∞

k=0 T
−k
(F )) \ (∪∞
k=n T
−k
(F )) is a null set.

(3). Show that F ∩ (T k ∈ F c evt.) is a null set.

(4). Assume that P (F ) > 0. Show that if T is ergodic, P (T n ∈ F i.o.) = 1.

◦

Exercise 2.9. Let (Ω, F) be a measurable space and let T : Ω → Ω be measurable. As-
sume that T is measure preserving. Show that the mapping T is ergodic if and only if it
holds for all random variables X and Y such that X is integrable and Y is bounded that
Pn
limn→∞ n1 k=1 EY (X ◦ T k−1 ) = (EY )(EX). ◦

Exercise 2.10. Consider the probability space ([0, 1), B[0,1) , P ) where P is the Lebesgue
measure. Define T : [0, 1) → [0, 1) by T (x) = 2x − [2x]. T is then P -measure preserving.
Show that T is mixing. ◦

Exercise 2.11. Let (Ω1 , F1 , P1 ) and (Ω2 , F2 , P2 ) be two probability spaces. Consider two
measurable mappings T1 : Ω1 → Ω1 and T2 : Ω2 → Ω2 . Assume that T1 is P1 -measure
preserving and that T2 is P2 -measure preserving. Define a probability space (Ω, F, P ) by
putting (Ω, F, P ) = (Ω1 × Ω2 , F1 ⊗ F2 , P1 ⊗ P2 ). Define a mapping T : Ω → Ω by putting
T (ω1 , ω2 ) = (T1 (ω1 ), T2 (ω2 )).

(1). Show that T is P -measure preserving.

(2). Let IT1 , IT2 and IT be the invariant σ-algebras for T1 , T2 and T . Show that the
inclusion IT1 ⊗ IT2 ⊆ IT holds.

(3). Argue that if T is ergodic, both T1 and T2 are ergodic.

(4). Argue that T is mixing if and only if both T1 and T2 are mixing.
◦

Exercise 2.12. Let (Xn ) be a stationary process. Fix B ∈ B. Show that (Xn ∈ B i.o.) is
in I(X). ◦

Exercise 2.13. Let (Xn ) and (Yn ) be two stationary processes. Let U be a random variable
concentrated on {0, 1} with P (U = 1) = p, and assume that U is independent of X and
independent of Y . Define Zn = Xn 1(U =0) + Yn 1(U =1) . Show that (Zn ) is stationary. ◦

Exercise 2.14. We say that a process (Xn ) is weakly stationary if it holds that Xn has
second moment for all n ≥ 1, EXn = EXk for all n, k ≥ 1 and Cov(Xn , Xk ) = γ(|n − k|) for
2.4 Exercises 57

some γ : N0 → R. Now assume that (Xn ) is some process such that Xn has second moment
for all n ≥ 1. Show that if (Xn ) is stationary, (Xn ) is weakly stationary. ◦

Exercise 2.15. We say that a process (Xn ) is Gaussian if all of its finite-dimensional
distributions are Gaussian. Let (Xn ) be some Gaussian process. Show that (Xn ) is stationary
if and only if it is weakly stationary in the sense of Exercise 2.14. ◦
58 Ergodicity and stationarity
Chapter 3

Weak convergence

In Chapter 1, in Definition 1.2.2, we introduced four modes of convergence for a sequence

of random variables: Convergence in probability, almost sure convergence, convergence in
Lp and convergence in distribution. Throughout most of the chapter, we concerned our-
selves solely with the first three modes of convergence. In this chapter, we instead focus
on convergence in distribution and the related notion of weak convergence of probability
distributions.

While our main results in Chapter 1 and Chapter 2 were centered around almost sure and Lp
Pn
convergence of n1 k=1 Xk for various classes of processes (Xn ), the theory of weak conver-
gence covered in this chapter will instead allow us to understand the asymptotic distribution
Pn
of n1 k=1 Xk , particularly through the combined results of Section 3.5 and Section 3.6.

The chapter is structured as follows. In Section 3.1, we introduce weak convergence of

probability measures, and establish that convergence in distribution of random variables and
weak convergence of probability measures essentially are the same. In Section 3.2, Section 3.3
and Section 3.4, we investigate the fundamental properties of weak convergence, in the first
two sections outlining conncetions with cumulative distribution functions and convergence
in probability, and in the third section introducing the characteristic function and proving
the major result that weak convergence of probability measures is equivalent to pointwise
convergence of characteristic functions.

After this, in Section 3.5, we prove several versions of the central limit theorem which in its
60 Weak convergence

Pn
simplest form states that under certain regularity conditions, the empirical mean n1 k=1 Xk
of independent and identically distributed random variables can for large n be approximated
Pn
by a normal distribution with the same mean and variance as n1 k=1 Xk . This is arguably
the main result of the chapter, and is a result which is of great significance in practical
statistics. In Section 3.6, we introduce the notion of asymptotic normality, which provides a
convenient framework for understanding and working with the results of Section 3.5. Finally,
in Section 3.7, we state without proof some multidimensional analogues of the results of the
previous sections.

3.1 Weak convergence and convergence of measures

Recall from Definition 1.2.2 that for a sequence of random variables (Xn ) and another random
D
variable X, we say that Xn converges in distribution to X and write Xn −→ X when
limn→∞ Ef (Xn ) = Ef (X) for all bounded, continuous mappings f : R → R. Our first
results of this section will show that convergence in distribution of random variables in a
certain sense is equivalent to a related mode of convergence for probability measures.

Definition 3.1.1. Let (µn ) be a sequence of probability measures on (R, B), and let µ be
wk
another probability measure. We say that µn converges weakly to µ and write µn −→ µ if it
R R
holds for all bounded, continuous mappings f : R → R that limn→∞ f dµn = f dµ.

Lemma 3.1.2. Let (Xn ) be a sequence of random variables and let X be another random
variable. Let µ denote the distribution of X, and for n ≥ 1, let µn denote the distribution of
D wk
Xn . Then Xn −→ X if and only if µn −→ µ.

R R R
Proof. We have Ef (Xn ) = f ◦ Xn dP = f dXn (P ) = f dµn , and by similar arguments,
R
Ef (X) = f dµ. From these observations, the result follows.

Lemma 3.1.2 clarifies that convergence in distribution of random variables is a mode of con-
vergence depending only on the marginal distributions of the variables involved. In particular,
we may investigate the properties of convergence in distribution of random variables by inves-
tigating the properties of weak convergence of probability measures on (R, B). Lemma 3.1.2
also allows us to make sense of convergence of random variables to a probability measure
in the following manner: We say that Xn converges in distribution to µ for a sequence of
D
random variables (Xn ) and a probability measure µ, and write Xn −→ µ, if it holds that
wk
µn −→ µ, where µn is the distribution of Xn .
3.1 Weak convergence and convergence of measures 61

The topic of weak convergence of probability measures in itself provides ample opportunities
for a rich mathematical theory. However, there is good reason for considering both conver-
gence in distribution of random variables and weak convergence of probability measures, in
spite of the apparent equivalence of the two concepts: Many results are formulated most
naturally in terms of random variables, particularly when transformations of the variables
are involved, and furthermore, expressing results in terms of convergence in distribution for
random variables often fit better with applications.

In the remainder of this section, we will prove some basic properties of weak convergence of
probability measures. Our first interest is to prove that weak limits of probability measures
are unique. By Cb (R), we denote the set of bounded, continuous mappings f : R → R, and
by Cbu (R), we denote the set of bounded, uniformly continuous mappings f : R → R. Note
that Cbu (R) ⊆ Cb (R).

Lemma 3.1.3. Assume given two intervals [a, b] ⊆ (c, d). There exists a function f ∈ Cbu (R)
such that 1[a,b] (x) ≤ f (x) ≤ 1(c,d) (x) for all x ∈ R.

Proof. As [a, b] ⊆ (c, d), we have that a ≤ x ≤ b implies c < x < d. In particular, c < a and
b < d. Then, to obtain the desired mapping, we simply define

 0

 for x ≤ c

 x−c
 a−c for c < x < a

f (x) = 1 for a ≤ x ≤ b ,
d−x

for b < x < d




 d−b
 0 for d ≤ x

and find that f possesses the required properties.

The mappings whose existence are proved in Lemma 3.1.3 are known as Urysohn functions,
and are also occasionally referred to as bump functions, although this latter name in general
is reserved for functions which have continuous derivatives of all orders. Existence results of
this type often serve to show that continuous functions can be used to approximate other
types of functions. Note that if [a, b] ⊆ (c, d) and 1[a,b] (x) ≤ f (x) ≤ 1(c,d) (x) for all x ∈ R, it
then holds for x ∈ [a, b] that 1 = 1[a,b] (x) ≤ f (x) ≤ 1(c,d) (x) = 1, so that f (x) = 1. Likewise,
for x ∈/ (c, d), we have 0 = 1[a,b] (x) ≤ f (x) ≤ 1(c,d) (x) = 0, so f (x) = 0.

In the following lemma, we apply Lemma 3.1.3 to show a useful criterion for two probability
measures to be equal, from which we will obtain as an immediate corollary the uniqueness
of limits for weak convergence of probability measures.
62 Weak convergence

R R
Lemma 3.1.4. Let µ and ν be two probability measures on (R, B). If f dµ = f dν for
all f ∈ Cbu (R), then µ = ν.

R R
Proof. Let µ and ν be probability measures on (R, B) and assume that f dµ = f dν for
all f ∈ Cbu (R). By the uniqueness theorem for probability measures, we find that in order to
prove that µ = ν, it suffices to show that µ((a, b)) = ν((a, b)) for all a < b. To this end, let
a < b be given. Now pick n ∈ N so large that a + 1/n < b − 1/n. By Lemma 3.1.3, there then
exists a mapping fn ∈ Cbu (R) such that 1[a+1/n,b−1/n] ≤ f ≤ 1(a,b) . By our assumptions, we
R R
then have fn dµ = fn dν.

Now, for x ∈ / (a, b), we have x ∈

/ [a + 1/n, b − 1/n] as well, so fn (x) = 0. And for x ∈ (a, b),
it holds that x ∈ [a + 1/n, b − 1/n] for n large enough, yielding fn (x) = 1 for n large enough.
Thus, limn→∞ fn (x) = 1(a,b) (x) for all x ∈ R. By the dominated convergence theorem, we
then obtain
Z Z Z Z
µ((a, b)) = 1(a,b) dµ = lim fn dµ = lim fn dν = 1(a,b) dν = ν((a, b)),
n→∞ n→∞

and the result follows.

Lemma 3.1.5. Let (µn ) be a sequence of probability measures on (R, B), and let µ and ν be
wk wk
two other such probability measures. If µn −→ µ and µn −→ ν, then µ = ν.

R R
Proof. For all f ∈ Cb (R), we obtain f dν = limn→∞ f dµn = f dµ. In particular, this
holds for f ∈ Cbu (R). Therefore, by Lemma 3.1.4, it holds that ν = µ.

Lemma 3.1.5 shows that limits for weak convergence of probability measures are uniquely
determined. Note that this is not the case for convergence in distribution of variables. To
understand the issue, note that combining Lemma 3.1.2 and Lemma 3.1.5, we find that if
D D
Xn −→ X, then we also have Xn −→ Y if and only if X and Y have the same distribution.
D
Thus, for example, if Xn −→ X, where X is normally distributed with mean zero, then
D
Xn −→ −X as well, since X and −X have the same distribution, even though it holds that
P (X = −X) = P (X = 0) = 0.

R R
In order to show weak convergence of µn to µ, we need to prove limn→∞ f dµn = f dµ for
all f ∈ Cb (R). A natural question is whether it suffices to prove this limit result for a smaller
class of mappings than Cb (R). We now show that it in fact suffices to consider elements of
Cbu (R). For f : R → R bounded, we denote by kf k∞ the uniform norm of f , meaning that
3.1 Weak convergence and convergence of measures 63

kf k∞ = supx∈R |f (x)|. Before obtaining the result, we prove the following useful lemma.
Sequences of probability measures satisfying the property (3.1) referred to in the lemma are
said to be tight.

Lemma 3.1.6. Let (µn ) be a sequence of probability measures on (R, B), and let µ be some
other probability measure. If limn→∞ f dµn = f dµ for all f ∈ Cbu (R), it holds that
R R

lim sup µn ([−M, M ]c ) = 0. (3.1)

M →∞ n≥1

In particular, (3.1) holds if µn is weakly convergent.

Proof. Fix ε > 0. We will argue that there is M > 0 such that µn ([−M, M ]c ) ≤ ε for n ≥ 1.
To this end, let M ∗ > 0 be so large that µ([−M ∗ /2, M ∗ /2]c ) < ε. By Lemma 3.1.3, we find
that there exists a mapping g ∈ Cbu (R) with 1[−M ∗ /2,M ∗ /2] (x) ≤ g(x) ≤ 1(−M ∗ ,M ∗ ) (x) for
x ∈ R. With f = 1 − g, we then also obtain 1(−M ∗ ,M ∗ )c (x) ≤ f (x) ≤ 1[−M ∗ /2,M ∗ /2]c (x). As
f ∈ Cbu (R) as well, this yields
Z Z
lim sup µn ([−M ∗ , M ∗ ]c ) ≤ lim sup 1(−M ∗ ,M ∗ )c dµn ≤ lim sup f dµn
n→∞ n→∞ n→∞
Z Z
= f dµ ≤ 1[−M ∗ /2,M ∗ /2]c dµ = µ([−M ∗ /2, M ∗ /2]c ) < ε,

and thus µn ([−M ∗ , M ∗ ]c ) < ε for n large enough, say n ≥ m. Now fix M1 , . . . , Mm > 0 such
that µn ([−Mn , Mn ]c ) < ε for n ≤ m. Putting M = max{M ∗ , M1 , . . . , Mm }, we obtain that
µn ([−M, M ]c ) ≤ ε for all n ≥ 1. This proves (3.1).

Theorem 3.1.7. Let (µn ) be a sequence of probability measures on (R, B), and let µ be
wk R R
some other probability measure. Then µn −→ µ if and only if limn→∞ f dµn = f dµ for
f ∈ Cbu (R).

wk
Proof. As Cbu (R) ⊆ Cb (R), it is immediate that if µn −→ µ, then limn→∞ f dµn = f dµ
R R

for f ∈ Cbu (R). We need to show the converse. Therefore, assume that for f ∈ Cbu (R),
R R R R
we have limn→∞ f dµn = f dµ. We wish to show that limn→∞ f dµn = f dµ for all
f ∈ Cb (R).

Using Lemma 3.1.6, take M > 0 such that µn ([−M, M ]c ) ≤ ε for all n ≥ 1 and such that
µ([−M, M ]c ) ≤ ε as well. Note that for any h ∈ Cb (R) such that khk∞ ≤ kf k∞ and
64 Weak convergence

f (x) = h(x) for x ∈ [−M, M ], we then have

Z Z Z Z
f dµ − h dµ = f 1[−M,M ]c dµ − h1[−M,M ]c dµ
Z Z
≤ f 1[−M,M ]c dµ + h1[−M,M ]c dµ

≤ kf k∞ µ([−M, M ]c ) + khk∞ µ([−M, M ]c ) ≤ 2εkf k∞ , (3.2)

and similarly for µn instead of µ. To complete the proof, we now take f ∈ Cb (R), we wish
to show limn→∞ f dµn = f dµ. To this end, we locate h ∈ Cbu (R) agreeing with f on
R R

[−M, M ] with khk∞ ≤ kf k∞ and apply (3.2). Define h : R → R by putting


 f (−M ) exp(M + x) for x < −M

h(x) = f (x) for − M ≤ x ≤ M .

f (M ) exp(M − x) for x > M


Then khk∞ ≤ kf k∞ . We wish to argue that h is uniformly continuous. Note that as

continuous functions are uniformly continuous on compact sets, f is uniformly continuous on
[−M, M ], and thus h also is uniformly continuous on [−M, M ]. Furthermore, for x, y > M
with |x − y| ≤ δ, the mean value theorem allows us to obtain

|h(x) − h(y)| ≤ |f (M )|| exp(M − x) − exp(M − y)| ≤ |f (M )||x − y|,

and similarly, |h(x) − h(y)| ≤ |f (−M )||x − y| for x, y < −M . We conclude that h is a con-
tinuous function which is uniformly continuous on (−∞, −M ), on [−M, M ] and on (M, ∞).
Hence, h is uniformly continuous on R. Furthermore, h agrees with f on [−M, M ]. Collecting
our conclusions, we now obtain by (3.2) that
Z Z Z Z Z Z Z Z
f dµn − f dµ ≤ f dµn − h dµn + h dµn − h dµ + h dµ − f dµ
Z Z
≤ 4εkf k∞ + h dµn − h dµ ,
R R
leading to lim supn→∞ | f dµn − f dµ| ≤ 4εkf k∞ . As ε > 0 was arbitrary, this shows
R R wk
limn→∞ f dµn = f dµ, proving µn −→ µ.

Before turning to a few examples, we prove some additional basic results on weak convergence.
Lemma 3.1.8 and Lemma 3.1.9 give results which occasionally are useful for proving weak
convergence.

Lemma 3.1.8. Let (µn ) be a sequence of probability measures on (R, B), and let µ be some
wk
other probability measure on (R, B). Let h : R → R be a continuous mapping. If µn −→ µ, it
wk
then also holds that h(µn ) −→ h(µ).
3.1 Weak convergence and convergence of measures 65

Proof. Let f ∈ Cb (R). Then f ◦ h ∈ Cb (R) as well, and we obtain

Z Z Z Z
lim f (x) dh(µn )(x) = lim f (h(x)) dµn (x) = f (h(x)) dµ(x) = f (x) dh(µ)(x),
n→∞ n→∞

wk
proving that h(µn ) −→ h(µ), as desired.

Lemma 3.1.9 (Scheffé’s lemma). Let (µn ) be a sequence of probability measures on (R, B),
and let µ be another probability measure on (R, B). Assume that there is a measure ν such
that µn = gn · ν for n ≥ 1 and µ = g · ν. If limn→∞ gn (x) = g(x) for ν-almost all x, then
wk
µn −→ µ.

R
Proof. To prove the result, we first argue that limn→∞ |gn − g| dν = 0. To this end, with
x+ = max{0, x} and x− = max{0, −x}, we first note that since both µn and µ are probability
measures, we have
Z Z Z Z Z
0 = gn dν − g dν = gn − g dν = (gn − g)+ dν − (gn − g)− dν,

(gn − g)+ dν = (gn − g)− dν and therefore

R R
which implies
Z Z Z Z
|gn − g| dν = (gn − g)+ dν + (gn − g)− dν = 2 (gn − g)− dν.

It therefore suffices to show that this latter tends to zero. To do so, note that

(gn − g)− (x) = max{0, −(gn (x) − g(x))} = max{0, g(x) − gn (x)} ≤ g(x), (3.3)

and furthermore, since x 7→ x− is continuous, (gn − g)− converges almost surely to 0. As

0 ≤ (gn − g)− ≤ g by (3.3), and g is integrable with respect to ν, the dominated convergence
theorem yields limn→∞ (gn − g)− dν = 0. Thus, we obtain limn→∞ |gn − g| dν = 0. In
R R

order to obtain the desired weak convergence from this, let f ∈ Cb (R), we then have
Z Z Z
lim sup f (x) dµn (x) − f (x) dµ(x) ≤ lim sup |f (x)||gn (x) − g(x)| dν(x)
n→∞ n→∞
Z
≤ kf k∞ lim sup |gn (x) − g(x)| dν(x) = 0,
n→∞

R R wk
proving limn→∞ f dµn = f dµ and hence µn −→ µ.

Lemma 3.1.8 shows that weak convergence is conserved under continuous transformations,
a result similar in spirit to Lemma 1.2.6. Lemma 3.1.9 shows that for probability measures
which have densities with respect to the same common measure, almost sure convergence of
66 Weak convergence

the densities is sufficient to obtain weak convergence. This is in several cases a very useful
observation.

This concludes our preliminary investigation of weak convergence of probability measures.

We end the section with some examples where weak convergence naturally occur.

Example 3.1.10. Let (xn ) be a sequence of real numbers and consider the corresponding
Dirac measures (εxn ), that is, εxn is the probability measure which accords probability one
to the set {xn } and zero to all Borel subsets in the complement of {xn }. We claim that if xn
converges to x for some x ∈ R, then εxn converges weakly to εx . To see this, take f ∈ Cb (R).
By continuity, we then have

Z Z
lim f dεxn = lim f (xn ) = f (x) = f dεx ,
n→∞ n→∞

yielding weak convergence of εxn to εx . ◦

Example 3.1.11. Let µn be the uniform distribution on {0, n1 , n2 , . . . , n−1

n }. We claim that
µn converges weakly to the uniform distribution on [0, 1]. To show this, let f ∈ Cb (R), we
Pn
then have f dµn = n1 k=1 f ((k − 1)/n). Now define a mapping fn : [0, 1] → R by putting
R
Pn R R1
fn (x) = k=1 f ((k − 1)/n)1[(k−1)/n,k/n) (x), we then obtain f dµn = 0 fn (x) dx. As f is
continuous, we have limn→∞ fn (x) = f (x) for all 0 ≤ x < 1. As f is bounded, the dominated
R R1
convergence theorem then allows us to conclude that limn→∞ f dµn = 0 f (x) dx, which
shows that µn converges weakly to the uniform distribution on [0, 1]. ◦

Note that the measures (µn ) in Example 3.1.11 are discrete in nature, while the limit measure
is continuous in nature. This shows that qualities such as being discrete or continuous in
nature are not preserved by weak convergence.

Example 3.1.12. Let (ξn ) and (σn ) be two real sequences with limits ξ and σ, respectively,
where we assume that σ > 0. Let µn be the normal distribution with mean ξn and variance
σn2 . We claim that µn converges to µ, where µ denotes the normal distribution with mean ξ
and variance σ 2 . To demonstrate this result, define mappings gn for n ≥ 1 and g by putting
1
gn (x) = σ √ exp(− 2σ12 (x−ξn )2 ) and g(x) = σ√12π exp(− 2σ1 2 (x−ξ)2 ). Then, µn has density
n 2π n
gn with respect to the Lebesgue measure, and µ has density g with respcet to the Lebesgue
measure. As gn converges pointwisely to g, Lemma 3.1.9 shows that µn converges to µ, as
desired. ◦
3.2 Weak convergence and distribution functions 67

3.2 Weak convergence and distribution functions

In this section, we investigate the connection between weak convergence of probability mea-
sures and convergence of the corresponding cumulative distribution functions. We will show
that weak convergence is not in general equivalent to pointwise convergence of cumulative
distribution functions, but is equivalent to pointwise convergence on a dense subset of R.

Lemma 3.2.1. Let (µn ) be a sequence of probability measures on (R, B), and let µ be some
other probability measure. Assume that µn has cumulative distribution function Fn for n ≥ 1,
wk
and assume that µ has cumulative distribution function F . If µn −→ µ, then it holds that
limn→∞ Fn (x) = F (x) whenever F is continuous at x.

wk
Proof. Assume that µn −→ µ and let x be such that F is continuous at x. Let ε > 0. By
Lemma 3.1.3, there exists h ∈ Cb (R) such that 1[x−2ε,x−] (y) ≤ h(y) ≤ 1(x−3ε,x) (y) for y ∈ R.
Putting f (y) = h(y) for y ≥ x − ε and f (y) = 1 for y < x − ε, we find that 0 ≤ f ≤ 1,
f (y) = 1 for y ≤ x − ε and f (y) = 0 for y > x. Thus, 1(−∞,x−ε] (y) ≤ f (y) ≤ 1(−∞,x] (y) for
R R
y ∈ R. This implies F (x − ε) ≤ f dµ and f dµn ≤ Fn (x), from which we conclude
Z Z
F (x − ε) ≤ f dµ = lim f dµn ≤ lim inf Fn (x). (3.4)
n→∞ n→∞

Similarly, there exists g ∈ Cb (R) such that 0 ≤ g ≤ 1, g(y) = 1 for y ≤ x and g(y) = 0 for
R R
y > x + ε, implying Fn (x) ≤ g dµn and g dµ ≤ F (x + ε) and allowing us to obtain
Z Z
lim sup Fn (x) ≤ lim g dµn = g dµ ≤ F (x + ε). (3.5)
n→∞ n→∞

Combining (3.4) and (3.5), we conclude that for all ε > 0, it holds that

F (x − ε) ≤ lim inf Fn (x) ≤ lim sup Fn (x) ≤ F (x + ε).

n→∞ n→∞

Since F is continuous at x, we may now let ε tend to zero and obtain that lim inf n→∞ Fn (x)
and lim supn→∞ Fn (x) are equal, and the common value is F (x). Therefore, Fn (x) converges
and limn→∞ Fn (x) = F (x). This completes the proof.

The following example shows that in general, weak convergence does not imply convergence
of the cumulative distribution functions in all points. After the example, we prove the gen-
eral result on the correspondence between weak convergence and convergence of cumulative
distribution functions.
68 Weak convergence

1
Example 3.2.2. For each n ≥ 1, let µn be the Dirac measure in n, and let µ be the Dirac
wk
measure at 0. According to Example 3.1.10, µn −→ µ. But with Fn being the cumulative
distribution function for µn and with F being the cumulative distribution function for µ, we
have Fn (0) = 0 for all n ≥ 1, while F (0) = 1, so that limn→∞ Fn (0) 6= F (0). ◦

Theorem 3.2.3. Let (µn ) be a sequence of probability measures on (R, B), and let µ be some
other probability measure. Assume that µn has cumulative distribution function Fn for n ≥ 1,
wk
and assume that µ has cumulative distribution function F . Then µn −→ µ if and only if
there exists a dense subset A of R such that limn→∞ Fn (x) = F (x) for x ∈ A.

wk
Proof. First assume that µn −→ µ, we wish to identify a dense subset of R such that we
have pointwise convergence of the cumulative distribution functions on this set. Let B be
the set of discontinuity points of F , B is then countable, so B c is dense in R. By Lemma
3.2.1, limn→∞ Fn (x) = F (x) whenever x ∈ B c , and so B c satisfies the requirements.

Next, assume that there exists a dense subset A of R such that limn→∞ Fn (x) = F (x) for
wk
x ∈ A. We wish to show that µn −→ µ. To this end, let f ∈ Cb (R), we need to prove
R R
limn→∞ f dµn = f dµ. Fix ε > 0. Recall that F (x) tends to zero and one as x tends
to minus infinity and infinity, respectively. Therefore, we may find a, b ∈ A with a < b such
that limn→∞ Fn (a) = F (a), limn→∞ Fn (b) = F (b), F (a) < ε and F (b) > 1 − ε. For n large
enough, we then also obtain Fn (a) < ε and Fn (b) > 1 − ε. Applying these properties, we
obtain
Z Z
lim sup f 1(−∞,a] dµn − f 1(−∞,a] dµ ≤ lim sup kf k∞ (µn ((−∞, a]) + µ((−∞, a]))
n→∞ n→∞

= lim sup kf k∞ (Fn (a) + F (a)) ≤ 2εkf k∞

n→∞

and similarly,
Z Z
lim sup f 1(b,∞) dµn − f 1(b,∞) dµ ≤ lim sup kf k∞ (µn ((b, ∞)) + µ((b, ∞)))
n→∞ n→∞

= lim sup kf k∞ (1 − Fn (b) + (1 − F (b))) ≤ 2εkf k∞ .

n→∞

As a consequence, we obtain the bound

Z Z Z Z
lim sup f dµn − f dµ ≤ 4εkf k∞ + lim sup f 1(a,b] dµn − f 1(a,b] dµ . (3.6)
n→∞ n→∞

Now, f is uniformly continuous on [a, b]. Pick δ > 0 parrying ε for this uniform continuity.
Using that A is dense in R, pick a partition a = x0 < x1 < · · · < xm = b of elements in
3.3 Weak convergence and convergence in probability 69

≤ εµn ((a, b]) ≤ ε,

P
and as Yn −→ 0, this implies lim supn→∞ |Ef (Xn + Yn ) − Ef (Xn )| ≤ ε. As ε > 0 was
arbitrary, this yields lim supn→∞ |Ef (Xn + Yn ) − Ef (Xn )| = 0. Combining this with (3.8),
we obtain limn→∞ Ef (Xn + Yn ) = Ef (X), as desired.

Theorem 3.3.3. Let (Xn , Yn ) be a sequence of random variables, let X be some other vari-
D P
able and let y ∈ R. Let h : R2 → R be a continuous mapping. If Xn −→ X and Yn −→ y,
D
then h(Xn , Yn ) −→ h(X, y).

Proof. First note that h(Xn , Yn ) = h(Xn , Yn ) − h(Xn , y) + h(Xn , y). Define hy : R → R
by hy (x) = h(x, y). The distribution of h(Xn , y) is then hy (Xn (P )) and the distribution
D
of h(X, y) is hy (X(P )). As we have assumed that Xn −→ X, Lemma 3.1.2 yields that
wk wk
Xn (P ) −→ X(P ). Therefore, as hy is continuous, hy (Xn (P )) −→ hy (X(P )) by Lemma
wk
3.1.8, which by Lemma 3.1.2 implies that h(Xn , y) −→ h(X, y). Therefore, by Lemma 3.3.2,
it suffices to prove that h(Xn , Yn ) − h(Xn , y) converges in probability to zero.

To this end, let ε > 0, we have to show limn→∞ P (|h(Xn , Yn ) − h(Xn , y)| > ε) = 0. We have
assumed that h is continuous. Equipping R2 with the metric d : R2 × R2 → [0, ∞) given by
d((a1 , a2 ), (b1 , b2 )) = |a1 − b1 | + |a2 − b2 |, h is in particular continuous with respect to this
metric on R2 . Now let M, η > 0 and note that h is uniformly continous on the compact set
[−M, M ] × [y − η, y + η]. Therefore, we may pick δ > 0 parrying ε for this uniform continuity,
and we may assume that δ ≤ η. We then have

(|Xn | ≤ M ) ∩ (|Yn − y| ≤ δ) ⊆ (|Xn | ≤ M ) ∩ (|Yn − y| ≤ δ) ∩ (d((Xn , Yn ), (Xn , y)) ≤ δ)

⊆ (|h(Xn , Yn ) − h(Xn , y)| ≤ ε),

which yields (|h(Xn , Yn ) − h(Xn , y)| > ε) ⊆ (|Xn | > M ) ∪ (|Yn − y| > δ) and thus

lim sup P (|h(Xn , Yn ) − h(Xn , y)| > ε) ≤ lim sup P (|Xn | > M ) + P (|Yn − y| > δ)
n→∞ n→∞

≤ sup P (|Xn | > M ).

n≥1

By Lemma 3.1.6, the latter tends to zero as M tends to infinity. We therefore conclude
that lim supn→∞ P (|h(Xn , Yn ) − h(Xn , y)| > ε) = 0, so h(Xn , Yn ) − h(Xn , y) converges in
probability to zero and the result follows.
72 Weak convergence

3.4 Weak convergence and characteristic functions

Let C denote the complex numbers. In this section, we will associate to each probability
measure µ on (R, B) a mapping ϕ : R → C, called the characteristic function of µ. We will see
that the characteristic function determines the probability measure uniquely, in the sense that
two probability measures with equal characteristic functions in fact are equal. Furthermore,
we will show, and this will be the main result of the section, that weak convergence of
probability measures is equivalent to pointwise convergence of characteristic functions. As
characteristic functions in general are pleasant to work with, both from theoretical and
practical viewpoints, this result is of considerable use.

Before we introduce the characteristic function, we recall some results from complex analysis.
For z ∈ C, we let <(z) and =(z) denote the real and imaginary parts of z, and with i denoting
the imaginary unit, we always have z = <(z) + i=(z). < and = are then mappings from C
to R. Also, for z ∈ C with z = <(z) + i=(z), we define z = <(z) − i=(z) and refer to z as
the complex conjugate of z.

Also recall that we may define the complex exponential by its Taylor series, putting
∞
X zn
ez =
n=0
n!

for any z ∈ C, where the series is absolutely convergent. We then also obtain the relationship
eiz = cos z + i sin z, where the complex cosine and the complex sine functions are defined by
their Taylor series,
∞ ∞
X (−1)n z 2n X (−1)n z 2n+1
cos z = and sin z = .
n=0
(2n)! n=0
(2n + 1)!

In particular, for t ∈ R, we obtain eit = cos t + i sin t, where cos t and sin t here are the
ordinary real cosine and sine functions. This shows that the complex exponential of a purely
imaginary argument yields a point on the unit circle corresponding to an angle of t measured
in radians.

Let (E, E, µ) be a measure space and let f : E → C be a complex valued function defined on
E. Then f (z) = <(f (z))+i=(f (z)). We refer to the mappings z 7→ <(f (z)) and z 7→ =(f (z))
as the real and imaginary parts of f , and denote them by <f and =f , respectively. Endowing
C with the σ-algebra BC generated by the open sets, it also holds that BC is the smallest
σ-algebra making < and = measurable. We then obtain that for any f : E → C, f is E-BC
measurable if and only if both the real and imaginary parts of f are E-B measurable.
3.4 Weak convergence and characteristic functions 73

Definition 3.4.1. Let (E, E, µ) be a measure space. A measurable function f : E → C is

said to be integrable if both <f and =f are integrable, and in the affirmative, the integral of
R R R
f is defined by f dµ = <f dµ + i =f dµ.

The space of integrable complex functions is denoted LC (E, E, µ) or simply LC . Note that we
have the inequalities |<f | ≤ |f |, |=f | ≤ |f | and |f | ≤ |<f | + |=f |. Therefore, f is integrable
if and only if |f | is integrable.

Example 3.4.2. Let γ 6= 0 be a real number. Since |eiγt | = 1 for all t ∈ R, t 7→ eiγt is
integrable with respect to the Lebesgue measure on all compact intervals [a, b]. As it holds
that eiγt = cos γt + i sin γt, we obtain
Z b Z b Z b
eiγt dt = cos γt dt + i sin γt dt
a a a
sin γb − sin γa − cos γb + cos γa
= +i
γ γ
−i eiγb − eiγa
= (cos γb + i sin γb − cos γa − i sin γa) = ,
γ iγ
extending the results for the real exponential function to the complex case. ◦

Lemma 3.4.3. Let (E, E, µ) be a measure space. If f, g ∈ LC and z, w ∈ C, it then holds

R R R
that zf + wg ∈ LC and zf + wg dµ = z f dµ + w g dµ.

Proof. We first show that for f integrable and z ∈ C, it holds that zf is integrable and
R R R R R
zf dµ = z f dµ. First off, note that |zf | dµ = |z||f | dµ = |z| |f | dµ < ∞, so zf is
integrable. Furthermore,
Z Z
zf dµ = (<(z) + i=(z))(<f + i=f ) dµ
Z
= <(z)<f − =(z)=f + i(<(z)=f + =(z)<f ) dµ
Z Z
= <(z)<f − =(z)=f dµ + i <(z)=f + =(z)<f dµ
Z Z Z Z
= <(z) <f dµ − =(z) =f dµ + i <(z) =f dµ + =(z) <f dµ
Z Z Z
= (<(z) + i=(z)) <f dµ + i =f dµ = z f dµ,

R R R
as desired. Next, we show that for f, g ∈ LC , f + g ∈ LC and f + g dµ = f dµ + g dµ.
First, as |f + g| ≤ |f | + |g|, it follows that f + g ∈ LC . In order to obtain the desired identity
74 Weak convergence

for the integrals, we note that

Z Z Z
f + g dµ = <(f + g) dµ + i =(f + g) dµ
Z Z
= <f + <g dµ + i =f + =g dµ
Z Z Z Z Z Z
= <f + i =f dµ + <g dµ + i =g dµ = f dµ + g dµ,

as desired. Collecting our conclusions, we obtain the desired result.

R R
Lemma 3.4.4. Let (E, E, µ) be a measure space. If f ∈ LC , then | f dµ| ≤ |f | dµ.

Proof. Recall that for z ∈ C, there exists θ ∈ R such that z = |z|eiθ . Applying this to the
integral f dµ, we obtain | f dµ| = e−iθ f dµ = e−iθ f dµ by Lemma 3.4.3. As the left
R R R R

hand side is real, the right hand side must be real as well. Hence, we obtain
Z Z Z
f dµ = < e−iθ f dµ = <(e−iθ f ) dµ
Z Z Z
≤ |<(e−iθ f )| dµ ≤ |e−iθ f | dµ = |f | dµ,

as desired.

Next, we state versions of the dominated convergence theorem and Fubini’s theorem for
complex mappings.
Theorem 3.4.5. Let (E, E, µ) be a measure space, and let (fn ) be a sequence of measurable
mappings from E to C. Assume that the sequence (fn ) converges µ-almost everywhere to
some mapping f . Assume that there exists a measurable, integrable mapping g : E → [0, ∞)
such that |fn | ≤ g µ-almost everywhere for all n. Then fn is integrable for all n ≥ 1, f is
measurable and integrable, and
Z Z
lim fn dµ = lim fn dµ.
n→∞ n→∞

Proof. As fn converges µ-almost everywhere to f , we find that <fn converges µ-almost every-
where to <f and =fn converges µ-almost everywhere to =f . Furthermore, we have |<fn | ≤ g
and |=fn | ≤ g µ-almost everywhere. Therefore, the dominated convergence theorem for real-
valued mappings yields
Z Z Z
lim fn dµ = lim <fn dµ + i =fn dµ
n→∞ n→∞
Z Z Z
= lim <fn dµ + i lim =fn dµ = lim fn dµ,
n→∞ n→∞ n→∞
3.4 Weak convergence and characteristic functions 75

as desired.

Theorem 3.4.6. Let (E, E, µ) and (F, F, ν) be two σ-finite measure spaces, and assume that
f : E × F → C is E ⊗ F measurable and µ ⊗ ν integrable. Then y 7→ f (x, y) is integrable with
respect to ν for µ-almost all x, the set where this is the case is measurable, and it holds that
Z Z Z
f (x, y) d(µ ⊗ ν)(x, y) = f (x, y) dν(y) dµ(x).

Proof. As f is µ ⊗ ν integrable, both <f and =f are µ ⊗ ν integrable as well. Therefore,

the Fubini theorem for real-valued mappings yields that y 7→ <f (x, y) and y 7→ =f (x, y)
are integrable µ-almost surely, and the sets where this is the case are measurable. As a
consequence, the set where y 7→ f (x, y) is integrable is measurable and is a µ-almost sure
set. The Fubini theorem for real-valued mappings furthermore yields that
Z Z Z
f (x, y) d(µ ⊗ ν)(x, y) = <f (x, y) d(µ ⊗ ν)(x, y) + i =f (x, y) d(µ ⊗ ν)(x, y)
Z Z Z Z
= <f (x, y) dν(y) dµ(x) + i =f (x, y) dν(y) dµ(x)
Z Z Z Z
= <f (x, y) dν(y) dµ(x) + i =f (x, y) dν(y) dµ(x)
Z Z Z
= <f (x, y) dν(y) + i =f (x, y) dν(y) dµ(x)
Z Z
= f (x, y) dν(y) dµ(x),

as was to be proven.

We are now ready to introduce the characteristic function of a probability measure on (R, B).

Definition 3.4.7. Let µ be a probability measure on (R, B). The characteristic function for
µ is the function ϕ : R → C defined by ϕ(θ) = eiθx dµ(x).
R

Since |eiθx | = 1 for all values of θ and x, the integral eiθx dµ(x) in Definition 3.4.7 is
R

always well-defined. For a random variable X with distribution µ, we also introduce the
characteristic function ϕ of X as the characteristic function of µ. The characteristic function
ϕ of X may then also be expressed as
Z Z Z
ϕ(θ) = eiθx dµ(x) = eiθx dX(P )(x) = eiθX(ω) dP (ω) = EeiθX .

The following lemmas demonstrate some basic properties of characteristic functions.

76 Weak convergence

Lemma 3.4.8. Let µ be a probability measure on (R, B) and assume that ϕ is the charac-
teristic function of µ. The mapping ϕ has the following properties.

(1). ϕ(0) = 1.

(2). For all θ ∈ R, |ϕ(θ)| ≤ 1.

(3). For all θ ∈ R, ϕ(−θ) = ϕ(θ).

(4). ϕ is uniformly continuous.

Furthermore, for n ≥ 1, if |x|n dµ(x) is finite, then ϕ is n times continuously differentiable,

and with ϕ(n) denoting the n’th derivative, we have ϕ(n) (θ) = in xn eiθx dµ(x). In particular,
R

in the affirmative, ϕ(n) (0) = in xn dµ(x).

Proof. The first claim follows as ϕ(0) = ei0x dµ(x) = µ(R) = 1, and the second claim
R

follows as |ϕ(θ)| = | eiθx dµ(x)| ≤ |eiθx | dµ(x) = 1. Also, the third claim follows since
R R

Z Z
i(−θ)x
ϕ(−θ) = e dµ(x) = cos(−θx) + i sin(−θx) dµ(x)
Z Z
= cos(−θx) dµ(x) + i sin(−θx) dµ(x)
Z Z
= cos(θx) dµ(x) − i sin(θx) dµ(x) = ϕ(θ).

To obtain the fourth claim, let θ ∈ R and let h > 0. We then have
Z Z
|ϕ(θ + h) − ϕ(θ)| = ei(θ+h)x − eiθx dµ(x) = eiθx (eihx − 1) dµ(x)
Z Z
≤ |eiθx ||(eihx − 1)| dµ(x) = |(eihx − 1)| dµ(x),

where limh→0 |(eihx − 1)| dµ(x) = 0 by the dominated convergence theorem. In order to
R

Next, we prove the results on the derivative. We apply an induction argument, and wish
to show for n ≥ 0 that if |x|n dµ(x) is finite, then ϕ is n times continuously differentiable
R

with ϕ(n) = in xn eiθx dµ(x). Noting that the induction start holds, it suffices to prove the
R
3.4 Weak convergence and characteristic functions 77

induction step. Assume that the result holds for n, we wish to prove it for n + 1. Assume
that |x|n+1 dµ(x) is finite. Fix θ ∈ R and h > 0. We then have
R

ϕ(n) (θ + h) − ϕ(n) (θ)

Z Z
1 n n i(θ+h)x n n iθx
= i x e dµ(x) − i x e dµ(x)
h h
eihx − 1
Z
n
=i xn eiθx dµ(x). (3.9)
h

We wish to apply the dominated convergence theorem to calculate the limit of the above as
h tends to zero. First note that by l’Hôpital’s rule, we have

eihx − 1 cos hx − 1 sin hx

lim = lim +i = lim −x sin hx + ix cos hx = ix,
h→0 h h→0 h h h→0

so the integrand in the final expression of (3.9) converges pointwise to ixn+1 eiθx . Note
Rx Rx
furthermore that since | cos x − 1| = | 0 sin y dy| ≤ |x| and | sin x| = | 0 cos y dy| ≤ |x| for
all x ∈ R, we have

eihx − 1 cos hx − 1 sin hx

= +i ≤ 2|x|,
h h h
ihx
yielding that |xn eiθx e h−1 | ≤ 2|x|n+1 . As we have assumed that the latter is integrable with
respect to µ, the dominated convergence theorem applies and allows us to conclude that

ϕ(n) (θ + h) − ϕ(n) (θ) eihx − 1

Z
lim = lim in xn eiθx dµ(x)
h→0 h h→0 h
eihx − 1
Z
= in lim xn eiθx dµ(x)
h→0 h
Z
= in+1 xn+1 dµ(x),

as desired. This proves that ϕ is n + 1 times differentiable, and yields the desired expression
for ϕ(n+1) . By another application of the dominated convergence theorem, we also obtain
that ϕ(n+1) is continuous. This completes the induction proof. As a consequence of this
latter result, it also follows that when |x|n dµ(x) is finite, ϕ(n) (0) = in xn dµ(x). This
R R

completes the proof of the lemma.

Lemma 3.4.9. Assume that X is a random variable with characteristic function ϕ, and let
α, β ∈ R. The variable α + βX has characteristic function φ given by φ(θ) = eiθα ϕ(βθ).

Proof. Noting that φ(θ) = Eeiθ(α+βX) = eiθα EeiβθX = eiθα ϕ(βθ), the result follows.
78 Weak convergence

Next, we show by example how to calculate the characteristic functions of a few distributions.

Example 3.4.10. Let ϕ be the characteristic function of the standard normal distribution,
we wish to obtain a closed-form expression for ϕ. We will do this by proving that ϕ satisfies
a particular differential equation. To this end, let f be the density of the standard normal
distribution, f (x) = √12π exp(− 21 x2 ). Note that for any θ ∈ R, we have by Lemma 3.4.8 that
Z ∞ Z ∞ Z ∞
ϕ(θ) = eiθx f (x) dx = e−iθx f (−x) dx = e−iθx f (x) dx = ϕ(−θ) = ϕ(θ).
−∞ −∞ −∞

R∞
As a consequence, =ϕ(θ) = 0, so ϕ(θ) = −∞
cos(θx)f (x) dx. Next, note that

d
cos(θx)f (x) = | − x sin(θx)f (x) ≤ |x|f (x),
dθ

which is integrable with respect to the Lebesgue measure. Therefore, ϕ(θ) is differentiable
for all θ ∈ R, and the derivative may be calculated by an exchange of limits. Recalling that
f 0 (x) = −xf (x), we obtain
Z ∞ Z ∞
0 d d
ϕ (θ) = cos(θx)f (x) dx = cos(θx)f (x) dx
dθ −∞ −∞ dθ
Z ∞ Z ∞
=− x sin(θx)f (x) dx = sin(θx)f 0 (x) dx.
−∞ −∞

Partial integration then yields

Z M
ϕ0 (θ) = lim sin(θx)f 0 (x) dx
M →∞ −M
Z M
= lim sin(θM )f (M ) − sin(−θM )f (−M ) − θ cos(θx)f (x) dx
M →∞ −M
Z M
= − lim θ cos(θx)f (x) dx = −θϕ(θ),
M →∞ −M

since limM →∞ f (M ) = limM →∞ f (−M ) = 0. Thus, ϕ satisfies ϕ0 (θ) = −θϕ(θ). All the
solutions to this differential equation are of the form θ 7→ c exp(− 21 θ2 ) for some c ∈ R, so we
conclude that there exists c ∈ R such that ϕ(θ) = c exp(− 21 θ2 ) for all θ ∈ R. As ϕ(0) = 1,
this implies ϕ(θ) = exp(− 21 θ2 ).

By Lemma 3.4.9, we then also obtain as an immediate corollary that the characteristic
function for the normal distribution with mean ξ and variance σ 2 , where σ > 0, is given by
θ 7→ exp(iξθ − 12 σ 2 θ2 ). ◦
3.4 Weak convergence and characteristic functions 79

Example 3.4.11. In this example, we derive the characteristic function for the standard
exponential distribution. Let ϕ denote the characteristic function, we then have
Z ∞ Z ∞
−x
ϕ(θ) = cos(θx)e dx + i sin(θx)e−x dx,
0 0

and we need to evaluate both of these intgrals. In order to do so, fix a, b ∈ R and note that
d
a cos(θx)e−x + b sin(θx)e−x = (bθ − a) cos(θx)e−x − (aθ + b) sin(θx)e−x .

dx
Next, note that the pair of equations bθ − a = c and −(aθ + b) = d have unique solutions in
a and b given by a = (−c − dθ)/(1 + θ2 ) and b = (cθ − d)/(1 + θ2 ), such that we obtain

d −c − dθ −x cθ − d −x
cos(θx)e + sin(θx)e = c cos(θx)e−x + d sin(θx)e−x . (3.10)
dx 1 + θ2 1 + θ2
Using (3.10) with c = 1 and d = 0, we conclude that
Z ∞ M
−x − cos(θx) −x θ sin(θx) −x 1
cos(θx)e dx = lim 2
e + 2
e = , (3.11)
0 M →∞ 1+θ 1+θ 0 1 + θ2
and likewise, using (3.10) with c = 0 and d = 1, we find
Z ∞ M
−x −θ cos(θx) −x − sin(θx) −x θ
sin(θx)e dx = lim 2
e + 2
e = . (3.12)
0 M →∞ 1+θ 1+θ 0 1 + θ2
Combining (3.11) and (3.12), we conclude
1 θ 1 + iθ 1 + iθ 1
ϕ(θ) = 2
+i 2
= 2
= = .
1+θ 1+θ 1+θ (1 + iθ)(1 − iθ) 1 − iθ
By Lemma 3.4.9, we then also obtain that the exponential distribution with mean λ for λ > 0
1
has characteristic function θ 7→ 1−iλθ . ◦

Example 3.4.12. We wish to derive the characteristic function for the Laplace distribution.
Denote by ϕ this characteristic function. Using the relationships sin(−θx) = − sin(θx) and
cos(−θx) = cos(θx) and recalling (3.10), we obtain
Z ∞ Z ∞
1 −|x| 1
ϕ(θ) = cos(θx) e dx + i sin(θx) e−|x| dx
−∞ 2 −∞ 2
Z ∞ Z ∞
1 −|x|
= cos(θx) e dx = cos(θx)e−x dx
−∞ 2 0
M
− cos(θx) −x θ sin(θx) −x 1
= lim e + e = .
M →∞ 1 + θ2 1 + θ2 0 1 + θ2
◦

Next, we introduce the convolution of two probability measures and argue that characteristic
functions interact in a simple manner with convolutions.
80 Weak convergence

Definition 3.4.13. Let µ and ν be two probability measures on (R, B). The convolution
µ ∗ ν of µ and ν is the probability measure h(µ ⊗ ν) on (R, B), where h : R2 → R is given by
h(x, y) = x + y.

The following lemma gives an important interpretation of the convolution of two probability
measures.

Lemma 3.4.14. Let X and Y be two independent random variables X and Y defined on
the same probability space. Assume that X has distribution µ and that Y has distribution ν.
Then X + Y has distribution µ ∗ ν.

Proof. As X and Y are independent, it holds that (X, Y )(P ) = µ ⊗ ν. With h : R2 → R

defined by h(x, y) = x + y, we then have, by the theorem on successive transformations, that

(X + Y )(P ) = h(X, Y )(P ) = h((X, Y )(P )) = h(µ ⊗ ν) = µ ∗ ν,

so µ ∗ ν is the distribution of X + Y .

Lemma 3.4.15. Let µ and ν be probability measures on (R, B) with characteristic functions
ϕ and φ. Then µ ∗ ν has characteristic function θ 7→ ϕ(θ)φ(θ).

Proof. Let ψ be the characteristic function of µ ∗ ν. Fix θ ∈ R, we need to demonstrate that

ψ(θ) = ϕ(θ)φ(θ). Let h : R2 → R be given by h(x, y) = x + y. Using Fubini’s theorem, we
obtain
Z Z
ψ(θ) = eiθz d(µ ∗ ν)(z) = eiθz dh(µ ⊗ ν)(z)
Z Z
iθh(x,y)
= e d(µ ⊗ ν)(x, y) = eiθ(x+y) d(µ ⊗ ν)(x, y)
Z Z Z Z
= eiθx eiθy dµ(x) dν(y) = eiθy eiθx dµ(x) dν(y)
Z Z
= e dµ(x) eiθy dν(y) = ϕ(θ)φ(θ),
iθx

proving the desired result.

As mentioned earlier, two of our main objectives in this section is to prove that probability
measures are uniquely determined by characteristic functions, and to prove that weak con-
vergence is equivalent to pointwise convergence of characteristic functions. To show these
results, we will employ a method based on convolutions with normal distributions.
3.4 Weak convergence and characteristic functions 81

We will need three technical lemmas. Lemma 3.4.16 shows that convoluting a probability
measure with a normal distribution approximates the original probability measure when the
mean in the normal distribution is zero and the variance is small. Lemma 3.4.17 will show
that if we wish to prove weak convergence of some sequence (µn ) to some µ, it suffices to
prove weak convergence when both the sequence and the limit are convoluted with a normal
distribution with mean zero and small variance. Intuitively, this is not a surprising result. Its
usefulness is clarified by Lemma 3.4.18, which states that the convolution of any probability
measure µ with a particular normal distribution has density with respect to the Lebesgue
measure, and the density can be obtained in closed form in terms of the characteristic function
of the measure µ. This is a frequently seen feature of convolutions: The convolution of two
probability measures in general inherits the regularity properties of each of the convoluted
measures, in this particular case, the regularity property of having a density with respect
to the Lebesgue measure. Summing up, Lemma 3.4.16 shows that convolutions with small
normal distributions are close to the original probability measure, Lemma 3.4.17 shows that
in order to prove weak convergence, it suffices to consider probability measures convoluted
with normal distributions, and Lemma 3.4.18 shows that such convolutions possess good
regularity properties.

Lemma 3.4.16. Let µ be a probability measure on (R, B). Let ξk be the normal distribution
wk
with mean zero and variance k1 . Then µ ∗ ξk −→ µ.

Proof. Consider a probability space endowed with two independent random variables X
and Y , where X has distribution µ and Y follows a standard normal distribution. Define
Yk = √1k Y , then Yk is independent of X and has distribution ξk . As a consequence, we also
obtain P (|Yk | > δ) ≤ δ −2 E|Yk |2 = δ −2 /k by Lemma 1.2.7, so Yk converges in probability to
D
0. Therefore, Lemma 3.3.2 yields X + Yk −→ X. However, by Lemma 3.4.14, X + Yk has
wk
distribution µ ∗ ξk . Thus, we conclude that µ ∗ ξk −→ µ.

Lemma 3.4.17. Let (µn ) be a sequence of probability measures on (R, B), and let µ be some
other probability measure. Let ξk be the normal distribution with mean zero and variance k1 .
wk wk
If it holds for all k ≥ 1 that µn ∗ ξk −→ µ ∗ ξk , then µn −→ µ as well.

R R
Proof. According to Theorem 3.1.7, it suffices to show that limn→∞ f dµn = f dµ for
f ∈ Cbu (R). To do so, let f ∈ Cbu (R). Fix n, k ≥ 1. For convenience, assume given
a probability space with independent random variables Xn , Yk and X, such that Xn has
82 Weak convergence

distribution µn , Yk has distribution ξk and X has distribution µ. We then have

Z Z
f dµn − f dµ = |Ef (Xn ) − Ef (X)|

≤ |Ef (Xn ) − Ef (Xn + Yk )| + |Ef (Xn + Yk ) − Ef (X + Yk )|

+ |Ef (X + Yk ) − Ef (X)|. (3.13)

We will prove that the limes superior of the left-hand side is zero by bounding the limes
superior of each of the three terms on the right-hand side. First note that by our assumptions,
R R
limn→∞ |Ef (Xn + Yk ) − Ef (X + Yk )| = limn→∞ | f d(µn ∗ ξk ) − f d(µ ∗ ξk )| = 0. Now
consider some ε > 0. Pick δ parrying ε for the uniform continuity of f . We then obtain

|Ef (Xn ) − Ef (Xn + Yk )| ≤ E|f (Xn ) − f (Xn + Yk )|

≤ ε + E|f (Xn ) − f (Xn + Yk )|1(|Yk |>δ)
≤ ε + 2kf k∞ P (|Yk | > δ),

Lemma 3.4.18. Let µ be some probability measure, and let ξk be the normal distribution
with mean zero and variance k1 . Let ϕ be the characteristic function for µ. The probability
measure µ ∗ ξk then has density f with respect to the Lebesgue measure, and the density is
given by
Z
1 1
f (u) = ϕ(x) exp − x2 e−iux dx.
2π 2k

Proof. Let x ∈ R. By Tonelli’s theorem and the change of variable u = y + z, we obtain

Z Z Z
(µ ∗ ξk )((−∞, x]) = 1(y+z≤x) d(µ ⊗ ξk )(y, z) = 1(y+z≤x) dξk (z) dµ(y)
Z Z √
k k
= 1(y+z≤x) √ exp − z 2 dz dµ(y)
2π 2
Z Z √
k k 2
= 1(u≤x) √ exp − (u − y) du dµ(y)
2π 2
Z x Z √
k k
= √ exp − (y − u)2 dµ(y) du.
−∞ 2π 2
3.4 Weak convergence and characteristic functions 83

This implies that µn ∗√ξk has density with respect to the Lebesgue measure, and the density f
R k
exp − k2 (y − u)2 dµ(y). By Example 3.4.10, exp(− k2 (y −u)2 ) is the

is given by f (u) = √2π
characteristic function of the normal distribution with mean zero and variance k, evaluated
in y − u. Therefore, we have
Z 2
k 1 x
exp − (y − u)2 = ei(y−u)x √ exp − dx.
2 2πk 2k
Substituting this in our expression for the density and applying Fubini’s theorem, we obtain
Z √ Z 2
k i(y−u)x 1 x
f (u) = √ e √ exp − dx dµ(y)
2π 2πk 2k
Z Z 2
1 x
= ei(y−u)x exp − dx dµ(y)
2π 2k
Z Z
1 iyx 1 2 −iux
= e dµ(y) exp − x e dx
2π 2k
Z
1 1
= ϕ(x) exp − x2 e−iux dx,
2π 2k
proving the lemma.

With Lemma 3.4.17 and Lemma 3.4.18 in hand, our main results on characteristic functions
now follow without much difficulty.

Theorem 3.4.19. Let µ and ν be probability measures on (R, B) with characteristic functions
ϕ and φ, respectively. Then µ and ν are equal if and only if ϕ and φ are equal.

Proof. If µ and ν are equal, we obtain for θ ∈ R that

Z Z
ϕ(θ) = e dµ(x) = eiθx dν(x) = φ(θ),
iθx

so ϕ and φ are equal. Conversely, assume that ϕ and φ are equal. Let ξk be the normal
distribution with mean zero and variance k1 . By Lemma 3.4.18, µ ∗ ξk and ν ∗ ξk both have
densities with respect to the Lebesgue measure, and the densities fk and gk are given by
Z
1 1
fk (u) = ϕ(x) exp − x2 e−iux dx and
2π 2k
Z
1 1
gk (u) = φ(x) exp − x2 e−iux dx,
2π 2k
respectively. As ϕ and φ are equal, fk and gk are equal, and so µ ∗ ξk and ν ∗ ξk are equal.
wk wk
By Lemma 3.4.16, µ ∗ ξk −→ µ and ν ∗ ξk −→ ν. As µ ∗ ξk and ν ∗ ξk are equal by our above
wk wk
observations, we find that µ ∗ ξk −→ µ and µ ∗ ξk −→ ν, yielding µ = ν by Lemma 3.1.5.
84 Weak convergence

Theorem 3.4.20. Let (µn ) be a sequence of probability measures on (R, B), and let µ be
some other probability measure. Assume that µn has characteristic function ϕn and that µ
wk
has characteristic function ϕ. Then µn −→ µ if and only if limn→∞ ϕn (θ) = ϕ(θ) for all
θ ∈ R.

wk
Proof. First assume that µn −→ µ. Fix θ ∈ R. Since x 7→ cos(θx) and x 7→ sin(θx) are in
Cb (R), we obtain
Z Z
lim ϕn (θ) = lim cos(θx) dµn (x) + i sin(θx) dµn (x)
n→∞ n→∞
Z Z
= cos(θx) dµ(x) + i sin(θx) dµ(x) = ϕ(θ),

as desired. This proves one implication. It remains to prove that if the characteristic functions
converge, the probability measures converge weakly.

In order to do so, assume that limn→∞ ϕn (θ) = ϕ(θ) for all θ ∈ R. We will use Lemma 3.4.17
and Lemma 3.4.18 to prove the result. Let ξk be the normal distribution with mean zero
and variance k1 . By Lemma 3.4.18, µn ∗ ξk and µ ∗ ξk both have densities with respect to the
Lebesgue measure, and the densities fnk and fk are given by
Z
1 1 2 −iux
fnk (u) = ϕn (x) exp − x e dx and
2π 2k
Z
1 1
fk (u) = ϕ(x) exp − x2 e−iux dx,
2π 2k

respectively. Since |ϕn | and |ϕ| are bounded by one, the dominated convergence theorem
yields limn→∞ fnk (u) = fk (u) for all k ≥ 1 and u ∈ R. By Lemma 3.1.9, we may then
wk wk
conclude µn ∗ ξk −→ µ ∗ ξk for all k ≥ 1, and Lemma 3.4.17 then shows that µn −→ µ, as
desired.

For the following corollary, we introduce Cb∞ (R) as the set of continuous, bounded functions
f : R → R which are differentiable infinitely often with bounded derivatives.

Corollary 3.4.21. Let (µn ) be a sequence of probability measures on (R, B), and let µ be
wk R R
some other probability measure. Then µn −→ µ if and only if limn→∞ f dµn = f dµ for
f ∈ Cb∞ (R).

wk
Proof. As Cb∞ (R) ⊆ Cb (R), it is immediate that if µn −→ µ, then limn→∞ f dµn = f dµ
R R

for f ∈ Cb∞ (R). To show the converse implication, assume that limn→∞ f dµn = f dµ for
R R
3.5 Central limit theorems 85

f ∈ Cb∞ (R). In particular, it holds for θ ∈ R that limn→∞ sin(θx) dµn (x) = sin(θx) dµ(x)
R R
R R
and limn→∞ cos(θx) dµn (x) = cos(θx) dµ(x). Letting ϕn and ϕ denote the characteristic
functions for µn and µ, respectively, we therefore obtain

Z Z
lim ϕn (θ) = lim cos(θx) dµn (x) + sin(θx) dµn (x)
n→∞ n→∞
Z Z
= cos(θx) dµ(x) + sin(θx) dµ(x) = ϕ(θ)

wk
for all θ ∈ R, so that Theorem 3.4.20 yields µn −→ µ, as desired.

3.5 Central limit theorems

In this section, we use our results from Section 3.4 to prove Lindeberg’s central limit theorem,
which gives sufficient requirements for a normalized sum of independent variables to be
approximated by a normal distribution in a weak convergence sense. This is one of the main
classical results in the theory of weak convergence.

The proof relies on proving pointwise convergence of characteristic functions and applying
Theorem 3.4.20. In order to prove such pointwise convergence, we will be utilizing some
finer properties of the complex exponential, as well as a particular inequality for complex
numbers. We begin by proving these auxiliary results, after which we prove the central limit
theorem for the case of independent and identically distributed random variables. This result
is weaker than the Lindeberg central limit theorem to be proven later, but the arguments
applied illustrate well the techniques to be used in the more difficult proof of Lindeberg’s
central limit theorem, which is given afterwards.

Lemma 3.5.1. Let z1 , . . . , zn and w1 , . . . , wn be complex numbers with |zi | ≤ 1 and |wi | ≤ 1
Qn Qn Pn
for all i = 1, . . . , n. It then holds that | i=1 zi − i=1 wi | ≤ i=1 |zi − wi |.
86 Weak convergence

Proof. For n ≥ 2, we have

and the desired result then follows by induction.

Lemma 3.5.2. It holds that

(1). |ex − (1 + x)| ≤ 21 x2 for all x ≤ 0.

x2
(2). eix − 1 − ix + 2 ≤ 32 x2 for all x ∈ R.
x2
(3). eix − 1 − ix + 2 ≤ 13 |x|3 for all x ∈ R.

Proof. To prove the first inequality, we apply a first order taylor expansion of the exponential
mapping around zero. Fix x ∈ R, by Taylor’s theorem we then find that there exists ξ(x)
between zero and x such that exp(x) = 1 + x + 21 exp(ξ(x))x2 , which for x ≤ 0 yields
| exp(x) − (1 + x)| ≤ 12 | exp(ξ(x))x2 | ≤ 12 x2 . This proves the first inequality.

Considering the second inequality, recall that eix = cos x + i sin x. We therefore obtain

|eix − 1 − ix + 21 x2 | = |eix − 1 − ix| + 12 x2 = | cos x − 1 + i(sin x − x)| + 21 x2

≤ | cos x − 1| + | sin x − x| + 21 x2 .

Recalling that cos0 = − sin, cos00 = − cos, sin0 = cos and sin00 = − sin, first order Taylor
expansions around zero yield the existence of ξ ∗ (x) and ξ ∗∗ (x) between zero and x such that

cos x = 1 − 1
2 cos(ξ ∗ (x))x2 and sin x = x − 1
2 sin(ξ ∗∗ (x))x2 ,

which yields | cos x − 1| ≤ 21 | cos(ξ ∗ (x))x2 | ≤ 21 x2 and | sin x − x| ≤ 12 | sin(ξ ∗∗ (x))x2 | ≤ 12 x2 .

Combining our three inequalities, we obtain |eix − 1 − ix + 21 x2 | ≤ 23 x2 , proving the second
inequality. Finally, we demonstrate the third inequality. Second order Taylor expansions
around zero yield the existence of η ∗ (x) and η ∗∗ (x) such that

cos x = 1 − 12 x2 + 1
6 sin(η ∗ (x))x3 and sin x = x − 1
6 cos(η ∗∗ (x))x3 ,
3.5 Central limit theorems 87

allowing us to obtain

|eix − 1 − ix + 12 x2 | = | cos x − 1 + 12 x2 + i(sin x − x)|

≤ | cos x − 1 + 21 x2 | + | sin x − x|
≤ 16 | sin(η ∗ (x))x3 | + 61 | cos(η ∗∗ (x))x3 | ≤ 13 |x|3 ,

as desired. This proves the third inequality.

The combination of Lemma 3.5.1, Lemma 3.5.2 and Theorem 3.4.20 is sufficient to obtain
the following central limit theorem for independent and identically distributed variables.

Theorem 3.5.3 (Classical central limit theorem). Let (Xn ) be a sequence of independent
and identically distributed random variables with mean ξ and variance σ 2 , where σ > 0. It
then holds that
n
1 X Xk − ξ D
√ −→ N (0, 1),
n σ
k=1

where N (0, 1) denotes the standard normal distribution.

Proof. It suffices to consider the case where ξ = 0 and σ 2 = 1. In this case, we have to
Pn D
argue that √1n k=1 Xk −→ N (0, 1). Denote by ϕ the common characteristic function of
Pn
Xn for n ≥ 1, and denote by ϕn the characteristic function of √1n k=1 Xk . Lemma 3.4.15
√
and Lemma 3.4.9 show that ϕn (θ) = ϕ(θ/ n)n . Recalling from Example 3.4.10 that the
standard normal distribution has characteristic function θ 7→ exp(− 12 θ2 ), Theorem 3.4.20
yields that in order to prove the result, it suffices to show for all θ ∈ R that
√ 2
lim ϕ(θ/ n)n = e−θ /2 . (3.14)
n→∞

To do so, first note that by Lemma 3.4.8 and Lemma 3.5.1 we obtain
√ √
|ϕ(θ/ n)n − exp(− 12 θ2 )| = |ϕ(θ/ n)n − exp(− 2n
1 2 n
θ ) |
√ 1 2
≤ n|ϕ(θ/ n) − exp(− 2n θ )|. (3.15)

Now, as the variables (Xn ) have second moment, we have from Lemma 3.4.8 that ϕ is two
times continuously differentiable with ϕ(0) = 1, ϕ0 (0) = 0 and ϕ00 (0) = −1. Therefore, a
first-order Taylor expansion shows that for each θ ∈ R, there exists ξ(θ) between 0 and θ
such that ϕ(θ) = ϕ(0) + ϕ0 (0)θ + 21 ϕ00 (ξ(θ))θ2 = 1 + 21 θ2 ϕ00 (ξ(θ)). In particular, this yields
√ 1 2 00
√
ϕ(θ/ n) = 1 + 2n θ ϕ (ξ(θ/ n))
√
= 1 − 2n 1 2
θ + 2n θ (1 + ϕ00 (ξ(θ/ n))).
1 2
(3.16)
88 Weak convergence

Combining (3.15) and (3.16) and applying the first inequality of Lemma 3.5.2, we obtain
√ √
|ϕ(θ/ n)n − exp(− 12 θ2 )| ≤ n|1 − 1 2
2n θ + 1 2
+ ϕ00 (ξ(θ/ n))) − exp(− 2n
2n θ (1
1 2
θ )|
1 2 1 2 1 2 00
√
≤ n|1 − 2n θ− exp(− 2n θ )| + 2 θ |1 + ϕ (ξ(θ/ n))|
√
≤ 2 ( 2n θ ) + 12 θ2 |1 + ϕ00 (ξ(θ/ n))|
n 1 2 2
√
= 8n θ + 12 θ2 |1 + ϕ00 (ξ(θ/ n))|.
1 4
(3.17)
√ √
Now, as n tends to infinity, θ/ n tends to zero, and so ξ(θ( n)) tends to zero. As ϕ00
√
by Theorem 3.4.8 is continuous with ϕ00 (0) = −1, this implies limn→∞ ϕ00 (ξ(θ/ n)) = −1.
√
Therefore, we obtain from (3.17) that lim supn→∞ |ϕ(θ/ n)n − exp(− 21 θ2 )| = 0, proving
Pn D
(3.14). As a consequence, Theorem 3.4.20 yields √1n k=1 Xk −→ N (0, 1). This concludes
the proof.

Theorem 3.5.3 and its proof demonstrates that in spite of the apparently deep nature of
the central limit theorem, the essential ingredients in its proof are simply first-order Taylor
expansions, bounds on the exponential function and Theorem 3.4.20. Next, we will show how
to extend Theorem 3.5.3 to the case where the random variables are not necessarily identically
distributed. The suitable framework for the statement of such more general results is that of
triangular arrays.

Definition 3.5.4. A triangular array is a double sequence (Xnk )n≥k≥1 of random variables.

Let (Xnk )n≥k≥1 be a triangular array. We think of (Xnk )n≥k≥1 as ordered in the shape of a
triangle as follows:

X11
X21 X22
X31 X32 X33
.. .. .. ..
. . . .
Pn
We may then define the row sums by putting Sn = k=1 Xnk , and we wish to establish
conditions under which Sn converges in distribution to a normally distributed limit. In
general, we will consider the case where (Xnk )k≤n is independent for each n ≥ 1, where
EXnk = 0 for all n ≥ k ≥ 1 and where limn→∞ V Sn = 1. In this case, it is natural to hope
that under suitable regularity conditions, Sn converges in distribution to a standard normal
distribution. The following example shows how the case considered in Theorem 3.5.3 can be
put in terms of a triangular array.
3.5 Central limit theorems 89

Example 3.5.5. Let X1 , X2 , . . . be independent and identically distributed random variables

with mean ξ and variance σ 2 , where σ > 0. For 1 ≤ k ≤ n, we then define Xnk = √1n Xkσ−ξ .
Ordering the variables in the shape of a triangle, we have
√1 X1 −ξ
1 σ
√1 X1 −ξ √1 X2 −ξ
2 σ 2 σ
√1 X1 −ξ √1 X2 −ξ √1 X3 −ξ
3 σ 3 σ 3 σ
.. .. .. ..
. . . .
Pn Pn Xk −ξ
The row sums of the triangular array are then Sn = k=1 Xnk = √1
n k=1 σ , which is
the same as the expression considered in Theorem 3.5.3. ◦
Theorem 3.5.6 (Lindeberg’s central limit theorem). Let (Xnk )n≥k≥1 be a triangular array
of variables with second moment. Assume that for each n ≥ 1, the family (Xnk )k≤n is
Pn
independent and assume that EXnk = 0 for all n ≥ k ≥ 1. With Sn = k=1 Xnk , assume
that limn→∞ V Sn = 1. Finally, assume that for all c > 0,
n
X
2
lim E1(|Xnk |>c) Xnk = 0. (3.18)
n→∞
k=1
D
It then holds that Sn −→ N (0, 1), where N (0, 1) denotes the standard normal distribution.

2
Pn
Proof. We define σnk = V Xnk and ηn2 = k=1 σnk 2
. Our strategy for the proof will be similar
to that for the proof of Theorem 3.5.3. Let ϕnk be the characteristic function of Xnk , and let
ϕn be the characteristic function of Sn . As (Xnk )k≤n is independent for each n ≥ 1, Lemma
Qn
3.4.15 shows that ϕn (θ) = k=1 ϕnk (θ). Recalling Example 3.4.10, we find that by Theorem
3.4.20, in order to prove the theorem, it suffices to show for all θ ∈ R that
n
Y
lim ϕnk (θ) = exp(− 21 θ2 ). (3.19)
n→∞
k=1

First note that by the triangle inequality, Lemma 3.4.8 and Lemma 3.5.1, we obtain
n
Y n
Y
ϕnk (θ) − exp(− 12 θ2 ) ≤ exp(− 12 ηn2 θ2 ) − exp(− 12 θ2 ) + ϕnk (θ) − exp(− 12 ηn2 θ2 )
k=1 k=1
n
X
≤ exp(− 21 ηn2 θ2 ) − exp(− 12 θ2 ) + 2 2
ϕnk (θ) − exp(− 12 σnk θ ) .
k=1

Combining our conclusions, we find that (3.19) follows if only we can show
n
X n
X
2 2 4
lim |ϕnk (θ) − (1 − 21 σnk θ )| = 0 and lim σnk = 0. (3.20)
n→∞ n→∞
k=1 k=1
2 2
Consider the first limit in (3.20). Fix c > 0. As EXnk = 0 and EXnk = σnk , we may apply
the two final inequalities of Lemma 3.5.2 to obtain
n
X n
X
2 2
|ϕnk (θ) − (1 − 12 σnk θ )| = |EeiθXnk − 1 − iEXnk + 21 θ2 EXnk
2
|
k=1 k=1
Xn
≤ E|eiθXnk − 1 − iXnk + 12 θ2 Xnk
2
|
k=1
Xn
≤ E1(|Xnk |≤c) 13 |θXnk |3 + E1(|Xnk |>c) 23 |θXnk |2
k=1
n
c|θ|3 2 3θ2 X 2
η + ≤ E1(|Xnk |>c) Xnk . (3.21)
3 n 2
k=1
Pn 2
Now, by our asumption (3.18), limn→∞ k=1 E1(|Xnk |>c) Xnk = 0, while we also have
3 2 3
limn→∞ (1/3)c|θ| ηn = (1/3)c|θ| . Applying these results with the bound (3.21), we obtain
n
X
2 2 c|θ|3
lim sup |ϕnk (θ) − (1 − 12 σnk θ )| ≤ ,
n→∞ 3
k=1
Pn
and as c > 0 was arbitrary, this yields limn→∞ k=1 |ϕnk (θ) − (1 − 12 σnk 2 2
θ )| = 0, as desired.
For the second limit in (3.20), we note that for all c > 0, it holds that
X n Xn
4 2 2
σnk ≤ max σnk σnk = ηn2 max EXnk 2
k≤n k≤n
k=1 k=1

= ηn2 max(E1(|Xnk |≤c) Xnk

2 2
+ E1(|Xnk |>c) Xnk )
k≤n
n
X
≤ ηn2 c2 + ηn2 2
E1(|Xnk |>c) Xnk ,
k=1
Pn
4
so by (3.18), lim supn→∞ k=1 σnk ≤ c2 for all c > 0. Again, as c > 0 was arbitrary, this
Pn 4
yields limn→∞ k=1 σnk = 0. Thus, both of the limit result in (3.20) hold. Therefore, (3.19)
holds, and so Theorem 3.4.20 allows us to conclude the proof.
3.5 Central limit theorems 91

The conditions given in Theorem 3.5.6 are in many cases sufficient to obtain convergence
in distribution to the standard normal distribution. The main important condition (3.18) is
known as Lindeberg’s condition. The condition, however, is not always easy to check. The
following result yields a central limit theorem where the conditions are less difficult to verify.
Here, the condition (3.22) is known as Lyapounov’s condition.
Theorem 3.5.7 (Lyapounov’s central limit theorem). Let (Xnk )n≥k≥1 be a triangular array
of variables with third moment. Assume that for each n ≥ 1, the family (Xnk )k≤n is inde-
Pn
pendent and assume that EXnk = 0 for all n ≥ k ≥ 1. With Sn = k=1 Xnk , assume that
limn→∞ V Sn = 1. Finally, assume that there is δ > 0 such that
n
X
lim E|Xnk |2+δ = 0. (3.22)
n→∞
k=1
D
It then holds that Sn −→ N (0, 1), where N (0, 1) denotes the standard normal distribution.

Proof. We note that for c > 0, it holds that |Xnk | > c implies 1 ≤ |Xnk |δ /cδ and so
n n n
X
2
X 1 X
E1(|Xnk |>c) Xnk ≤ E1(|Xnk |>c) c1δ |Xnk |2+δ ≤ δ E|Xnk |2+δ ,
c
k=1 k=1 k=1

so Lyapounov’s condition (3.22) implies Lindeberg’s condition (3.18). Therefore, the result
follows from Theorem 3.5.6.

In order to apply Theorem 3.5.7, we require that the random variables in the triangular array
have third moments. In many cases, this requirement is satisfied, and so Lyapounov’s central
limit theorem is frequently useful. However, the moment condition is too strong to obtain
the classical central limit theorem of Theorem 3.5.3 as a corollary. As the following example
shows, this theorem in fact does follow as a corollary from the stronger Lindeberg’s central
limit theorem.

Example 3.5.8. Let (Xn ) be a sequence of independent and identically distributed random
variables with mean ξ and variance σ 2 , where σ > 0. As in Example 3.5.5, we define a
triangular array by putting Xnk = √1n Xkσ−ξ for n ≥ k ≥ 1. The elements of each row are
then independent, with EXnk = 0, and the row sums of the triangular array are given by
Pn Pn
Sn = k=1 Xnk = √1n k=1 Xkσ−ξ and satisfy V Sn = 1. We obtain for c > 0 that
n n
X
2
X (Xk − ξ)2
lim E1(|Xnk |>c) Xnk = lim E1(|Xk −ξ|>cσ√n)
n→∞ n→∞ σ2 n
k=1 k=1
1 √ (X − ξ)2 = 0,
= lim E1 1
n→∞ σ 2 (|X1 −ξ|>cσ n)
92 Weak convergence

by the dominated convergence theorem, since X1 has second moment. Thus, we conclude
that the triangular array satisfies Lindeberg’s condition, and therefore, Theorem 3.5.6 applies
Pn D
and yields √1n k=1 Xkσ−ξ = Sn −→ N (0, 1), as in Theorem 3.5.3. ◦

3.6 Asymptotic normality

In Section 3.5, we saw examples of particular normalized sums of random variables converging
to a standard normal distribution. The intuitive interpretation of these results is that the
non-normalized sums approximate normal distributions with nonstandard parameters. In
order to easily work with this idea, we in this section introduce the notion of asymptotic
normality.
Definition 3.6.1. Let (Xn ) be a sequence of random variables, and let ξ and σ be real with
σ > 0. We say that Xn is asymptotically normal with mean ξ and variance n1 σ 2 if it holds
√ D
that n(Xn − ξ) −→ N (0, σ 2 ), where N (0, σ 2 ) denotes the normal distribution with mean
zero and variance σ 2 . If this is the case, we write

as 1
Xn ∼ N ξ, σ 2 . (3.23)
n

The results of Theorem 3.5.3 can be restated in terms of asymptotic normality as fol-
lows. Assume that (Xn ) is a sequence of independent and identically distributed random
variables with mean ξ and variance σ 2 , where σ > 0. Theorem 3.5.3 then states that
Pn Xk −ξ D Pn D
√1 −→ N (0, 1). By Lemma 3.1.8, this implies √1n k=1 (Xk − ξ) −→ N (0, σ 2 ),
n k=1 σ
and so
n
! ! n
√ 1X 1 X D
n Xk − ξ = √ (Xk − ξ) −→ N (0, σ 2 ),
n n
k=1 k=1
1
Pn as 1 2
which by Definition 3.6.1 corresponds to n k=1 Xk ∼ N (ξ, n σ ). The intuitive content of
n
this statement is that as n tends to infinity, the average n1 k=1 Xk is approximated by a
P

normal distribution with the same mean and variance as the empirical average, namely ξ and
1 2
nσ .

We next show two properties of asymptotic normality, namely that for asymptotically nor-
mal sequences (Xn ), Xn converges in probability to the mean, and we show that asymptotic
normality is preserved by transformations with certain mappings. These results are of con-
siderable practical importance when analyzing the asymptotic properties of estimators based
on independent and identically distributed samples.
3.6 Asymptotic normality 93

Lemma 3.6.2. Let (Xn ) be a sequence of random variables, and let ξ and σ be real with
σ > 0. Assume that Xn is asymptotically normal with mean ξ and variance n1 σ 2 . It then
P
holds that Xn −→ ξ.

√ D
Proof. Fix ε > 0. As Xn is asymptotically normal, n(Xn − ξ) −→ N (0, σ 2 ), so Lemma
√
3.1.6 yields limM →∞ supn≥1 P ( n|Xn − ξ| ≥ M ) = 0. Now let M > 0, we then have
√ √
lim sup P (|Xn − ξ| ≥ ε) = lim sup P ( n|Xn − ξ| ≥ nε)
n→∞ n→∞
√
≤ lim sup P ( n|Xn − ξ| ≥ M )
n→∞
√
≤ sup P ( n|Xn − ξ| ≥ M ),
n≥1

and as M > 0 was arbitrary, this implies lim supn→∞ P (|Xn − ξ| ≥ ε) = 0. As ε > 0 was
P
arbitrary, we obtain Xn −→ ξ.

Theorem 3.6.3 (The delta method). Let (Xn ) be a sequence of random variables, and let ξ
and σ be real with σ > 0. Assume that Xn is asymptotically normal with mean ξ and variance
1 2
n σ . Let f : R → R be measurable and differentiable in ξ. Then f (Xn ) is asymptotically
normal with mean f (ξ) and variance n1 σ 2 f 0 (ξ)2 .

√ D
Proof. By our assumptions, n(Xn − ξ) −→ N (0, σ 2 ). Our objective is to demonstrate that
√ D
n(f (Xn ) − f (ξ)) −→ N (0, σ 2 f 0 (ξ)2 ). Note that when defining R : R → R by putting
R(x) = f (x) − f (ξ) − f 0 (ξ)(x − ξ), we obtain f (x) = f (ξ) + f 0 (ξ)(x − ξ) + R(x), and in
particular
√ √
n(f (Xn ) − f (ξ)) = n(f 0 (ξ)(Xn − ξ) + R(Xn ))
√ √
= f 0 (ξ) n(Xn − ξ) + nR(Xn ) (3.24)
√ D √ D
As n(Xn − ξ) −→ N (0, σ 2 ), Lemma 3.1.8 shows that f 0 (ξ) n(Xn − ξ) −→ N (0, σ 2 f 0 (ξ)2 ).
√ P
Therefore, by Lemma 3.3.2, the result will follow if we can prove nR(Xn ) −→ 0. To this
end, let ε > 0. Note that as f is differentiable at ξ, we have
R(x) f (x) − f (ξ)
lim = lim − f 0 (ξ) = 0.
x→ξ x − ξ x→ξ x−ξ
Defining r(x) = R(x)/(x − ξ) when x 6= ξ and r(ξ) = 0, we then find that r is measurable and
continuous at ξ, and R(x) = (x − ξ)r(x). In particular, there exists δ > 0 such that whenever
|x − ξ| < δ, we have |r(x)| < ε. It then also holds that if |r(x)| ≥ ε, we have |x − ξ| ≥ δ. From
this and Lemma 3.6.2, we get lim supn→∞ P (|r(Xn )| ≥ ε) ≤ lim supn→∞ P (|Xn −ξ| ≥ δ) = 0,
94 Weak convergence

P
so r(Xn ) −→ 0. As the multiplication mapping (x, y) 7→ xy is continuous, we obtain by
√ √ D
Theorem 3.3.3 that nR(Xn ) = n(Xn − ξ)r(Xn ) −→ 0, and so by Lemma 3.3.1, we get
√ P
nR(Xn ) −→ 0. Combining our conclusions with (3.24), Lemma 3.3.2 now shows that
√ D
n(f (Xn ) − f (ξ)) −→ N (0, σ 2 f 0 (ξ)2 ), completing the proof.

Using the preceeding results, we may now give an example of a practical application of the
central limit theorem and asymptotic normality.

Example 3.6.4. As in Example 1.5.4, consider a measurable space (Ω, F) endowed with a
sequence of random variables (Xn ). Assume given for each ξ ∈ R a probability measure Pξ
such that for the probability space (Ω, F, Pξ ), (Xn ) consists of independent and identically
distributed variables with mean ξ and unit variance. We may then define an estimator of
Pn
the mean by putting ξˆn = n1 k=1 Xk . As the variables have second moment, Theorem 3.5.3
shows that ξˆn is asymptotically normal with mean ξ and variance n1 .

This intuitively gives us some information about the distribution of ξˆn for large n ≥ 1. In
order to make practical use of this, let 0 < γ < 1. We consider the problem of obtaining a
confidence interval for the parameter ξ with confidence level approximating γ as n tends to
infinity. The statement that ξˆn is asymptotically normal with the given parameters means
√ D
that n(ξˆn − ξ) −→ N (0, 1). With Φ denoting the cumulative distribution function for
√
the standard normal distribution, we obtain limn→∞ Pξ ( n(ξˆn − ξ) ≤ x) = Φ(x) for all
x ∈ R by Lemma 3.2.1. Now let zγ be such that Φ(−x) = (1 − γ)/2, meaning that we have
zγ = −Φ−1 ((1−γ)/2). As (1−γ)/2 < 1/2, zγ > 0. Also, Φ(zγ ) = 1−Φ(−zγ ) = 1−(1−γ)/2,
and so we obtain

√
lim Pξ (−zγ ≤ n(ξˆn − ξ) ≤ zγ ) = Φ(zγ ) − Φ(−zγ ) = γ.
n→∞

However, we also have

√ √ √
Pξ (−zγ ≤ n(ξˆn − ξ) ≤ zγ ) = Pξ (−zγ / n ≤ ξˆn − ξ ≤ zγ / n)
√ √
= Pξ (ξˆn − zγ / n ≤ ξ ≤ ξˆn + zγ / n),

√ √
so if we define Iγ = (ξˆn − zγ / n, ξˆn + zγ / n), we have limn→∞ Pξ (ξ ∈ Iγ ) = γ for all
ξ ∈ R. This means that asymptotically speaking, there is probability γ that Iγ contains ξ.
√ √
In particular, as Φ(−1.96) ≈ 2.5%, we find that (ξˆn − 1.96/ n, ξˆn + 1.96/ n) is a confidence
interval which a confidence level approaching a number close to 95% as n tends to infinity. ◦
3.7 Higher dimensions 95

3.7 Higher dimensions

Throughout this chapter, we have worked with weak convergence of random variables with
values in R, as well as probability measures on (R, B). Among our most important results
are the results that weak convergence is equivalent to convergence of characteristic functions,
the interplay between convergence in distribution and convergence in probability, the central
limit theorems and our results on asymptotic normality. The theory of weak convergence
and all of its major results can be extended to the more general context of random variables
with values in Rd and probability measures on (Rd , Bd ) for d ≥ 1, and to a large degree, it
is these multidimensional results which are most useful in practice. In this section, we state
the main results from the multidimensional theory of weak convergence without proof.

In the following, Cb (Rd ) denotes the set of continuous, bounded mappings f : Rd → R.

Definition 3.7.1. Let (µn ) be a sequence of probability measures on (Rd , Bd ), and let µ be
wk
another probability measure. We say that µn converges weakly to µ and write µn −→ µ if it
holds for all f ∈ Cb (Rd ) that limn→∞ f dµn = f dµ.
R R

As in the univariate case, the limit measure is determined uniquely. Also, we say that a
sequence of random variables (Xn ) with values in Rd converges in distribution to a random
variable X with values in Rd or a probability measure µ on (Rd , Bd ) if the distributions
converge weakly. The following analogue of Lemma 3.1.8 then holds.

Lemma 3.7.2. Let (µn ) be a sequence of probability measures on (Rd , Bd ), and let µ be
another probability measure. Let h : Rd → Rp be some continuous mapping. If it holds that
wk wk
µn −→ µ, then it also holds that h(µn ) −→ h(µ).

An important result which relates multidimensional weak convergence to one-dimensional

weak convergence is the following result. In Theorem 3.7.3, θt denotes transpose, and the
mapping x, y 7→ xt y for x, y ∈ Rd thus corresponds to the ordinary inner product on Rd .

Theorem 3.7.3 (Cramér-Wold’s device). Let (Xn ) be a sequence of random variables with
D
values in Rd , and let X be some other such variable. Then Xn −→ X if and only if it holds
D
for all θ ∈ Rd that θt Xn −→ θt X.

Letting (Xn )n≥1 be a sequence of random variables with values in Rd and letting X be
some other such variable, we may define a multidimensional analogue of convergence in
P
probability by saying that Xn converges in probability to X and writing Xn −→ X when
96 Weak convergence

limn→∞ P (kXn − Xk ≥ ε) = 0 for all ε > 0, where k · k is some norm on Rd . We then

P D
have that Xn −→ x if and only if Xn −→ x, and the following multidimensional version of
Theorem 3.3.3 holds.
Theorem 3.7.4. Let (Xn , Yn ) be a sequence of random variables with values in Rd and Rp ,
respectively, let X be some other variable with values in Rd and let y ∈ Rp . Consider a
D P
continuous mapping h : Rd × Rp → Rm . If Xn −→ X and Yn −→ y, then it holds that
D
h(Xn , Yn ) −→ h(X, y).

We may also define characteristic functions in the multidimensional setting. Let µ be a prob-
ability measure on (Rd , Bd ). We define the characteristic function for µ to be the mapping
t
ϕ : Rd → C defined by ϕ(θ) = eiθ x dµ(x). As in the one-dimensional case, the characteris-
R

tic function determines the probability measure uniquely, and weak convergence is equivalent
to pointwise convergence of probability measures.

The central limit theorem also holds in the multidimensional case.

Theorem 3.7.5. Let (Xn ) be a sequence of independent and identically distributed random
variables with values in Rd with mean vector ξ and positive semidefinite variance matrix Σ.
It then holds that
n
1 X D
√ (Xk − ξ) −→ N (0, Σ),
n
k=1

where N (0, Σ) denotes the normal distribution with mean zero and variance matrix Σ.

As in the one-dimensional case, we may introduce a notion of asymptotic normality. For

a sequence of random variables (Xn ) with values in Rd , we say that Xn is asymptotically
√ D
normal with mean ξ and variance n1 Σ if n(Xn − ξ) −→ N (0, Σ), and in this case, we write
as as P
Xn ∼ N (ξ, n1 Σ). If Xn ∼ N (ξ, n1 Σ), it also holds that Xn −→ ξ. Also, we have the following
version of the delta method in the multidimensional case.
Theorem 3.7.6. Let (Xn ) be a sequence of random variables with values in Rd , and assume
that Xn is asymptotically normal with mean ξ and variance n1 Σ. Let f : Rd → Rp be
measurable and differentiable in ξ, then f (Xn ) is asymptotically normal with mean f (ξ) and
variance n1 Df (ξ)ΣDf (ξ)t , where Df (ξ) is the Jacobian of f at ξ, that is, the p × d matrix
consisting of the partial derivatives of f at ξ.

Note that Theorem 3.7.6 reduces to Theorem 3.6.3 for d = p = 1, and in the one-dimensional
case, the products in the expression for the asymptotic variance commute, leading to a simpler
expression in the one-dimensional case than in the multidimensional case.
3.7 Higher dimensions 97

To show the strengh of the multidimensional theory, we give the following example, extending
Example 3.6.4.

Example 3.7.7. As in Example 3.6.4, consider a measurable space (Ω, F) endowed with a
sequence of random variables (Xn ). Let Θ = R × (0, ∞). Assume for each θ = (ξ, σ 2 ) that
we are given a probability measure Pθ such that for the probability space (Ω, F, Pθ ), (Xn )
consists of independent and identically distributed variables with fourth moment, and with
mean ξ and variance σ 2 . As in Example 1.5.4, we may then define estimators of the mean
and variance based on n samples by putting
n n n
!2
1X 1X 2 1X
ξˆn = Xk and σ̂n2 = Xk − Xk .
n n n
k=1 k=1 k=1

Now note that the variables (Xn , Xn2 ) also are independent and identically distributed, and
with ρ denoting Cov(Xn , Xn2 ) and η 2 denoting V Xn2 , we have
! ! ! !
Xn ξ Xn σ2 ρ
E = and V = . (3.25)
Xn2 σ2 + ξ2 Xn2 ρ η2

Let µ and Σ denote the mean and variance, respectively, in (3.25). By X n and Xn2 , we
Pn Pn
denote n1 k=1 Xk and n1 k=1 Xk2 , respectively. Using Theorem 3.7.5, we then obtain that
(X n , Xn2 ) is asymptotically normal with parameters (µ, n1 Σ).

We will use this multidimensional relationship to find the asymptotic distributions of ξˆn and
σ̂n2 , and we will do so by applying Theorem 3.7.6. To this end, we first consider the mapping
f : R2 → R given by f (x, y) = x. Note that we have Df (x, y) = (1 0). As ξˆn = f (X n , Xn2 ),
Theorem 3.7.6 yields that ξˆn is asymptotically normal with mean f (µ) = ξ and variance
! !
1 1 σ2 ρ 1 1
t
Df (µ)ΣDf (µ) = 1 0 2
= σ2 ,
n n ρ η 0 n

in accordance with what we would have obtained by direct application of Theorem 3.5.3.
Next, we consider the variance estimator. Define g : R2 → R by putting g(x, y) = y − x2 .
We then have Dg(x, y) = (−2x 1). As σ̂n2 = g(X n , Xn2 ), Theorem 3.7.6 shows that σ̂n2 is
asymptotically normal with mean g(µ) = σ 2 and variance
! !
1 1 σ2 ρ −2ξ 1
t
Dg(µ)ΣDg(µ) = −2ξ 1 2
= (4ξ 2 σ 2 − 4ξρ + η 2 ).
n n ρ η 1 n

Thus, applying Theorem 3.7.5 and Theorem 3.7.6, we have proven that both ξˆn and σ̂n2 are
asymptotically normal, and we have identified the asymptotic parameters.
98 Weak convergence

Next, consider some 0 < γ < 1. We will show how to construct a confidence interval for ξ
which has a confidence level approximating γ as n tends to infinity. Note that this was already
accomplished in Example 3.6.4 in the case where the variance was known and equal to one.
In this case, we have no such assumptions. Now, we already know that ξˆn is asymptotically
√ D
normal with parameters (ξ, n1 σ 2 ), meaning that n(ξˆn − ξ) −→ N (0, σ 2 ). Next, note that as
P
σ̂n2 ias asymptotically normal with mean σ 2 , Lemma 3.6.2 shows that σ̂n2 −→ σ 2 . Therefore,
√ D
using Theorem 3.3.3, we find that n(ξˆ − ξ)/ σ̂n2 −→ N (0, 1). We may now proceed as
p

in Example 3.6.4 and note that with Φ denoting the cumulative distribution function for
√
the standard normal distribution, limn→∞ P ( n(ξˆn − ξ)/ σ̂n2 ≤ x) = Φ(x) for all x ∈ R by
p

Lemma 3.2.1. Putting zγ = −Φ−1 ((1−γ)/2), we then obtain zγ > 0 and Φ(zγ )−Φ(−zγ ) = γ,
and if we define Iγ = (ξˆn − zγ σ̂n2 /n, ξˆn + zγ σ̂n2 /n), we then obtain
p p

√ √
lim Pθ (ξ ∈ Iγ ) = lim Pθ (ξˆn − σ̂n2 zγ / n ≤ ξ ≤ ξˆn + σ̂n2 zγ / n)
p p
n→∞ n→∞
√ √
= lim Pθ (− σ̂n2 zγ / n ≤ ξˆn − ξ ≤ σ̂n2 zγ / n)
p p
n→∞
√
= lim Pθ (−zγ ≤ n(ξˆn − ξ)/ σ̂n2 ≤ zγ )
p
n→∞

= Φ(zγ ) − Φ(−zγ ) = γ,

so Iγ is a confidence interval for ξ such that asymptotically speaking, there is probability γ

that Iγ contains ξ. ◦

3.8 Exercises

Exercise 3.1. Let (θn ) be a sequence of positive numbers. Let µn denote the uniform
distribution on [0, θn ]. Show that µn converges weakly if and only if θn is convergent. In the
affirmative case, identify the limiting distribution. ◦

Exercise 3.2. Let (µn ) be a sequence of probability measures concentrated on N0 , and let
wk
µ be another such probability measure. Show that µn −→ µ if and only if it holds that
limn→∞ µn ({k}) = µ({k}) for all k ≥ 0. ◦

Exercise 3.3. Let µn denote the Student’s t-distribution with n degrees of freedom, that
Γ(n+ 21 ) x2 −(n+ 12 )
is, the distribution with density fn given by fn (x) = √2nπΓ(n) (1 + 2n ) . Show that µn
converges weakly to the standard normal distribution. ◦

Exercise 3.4. Let (pn ) be a sequence in (0, 1), and let µn be the binomial distribution with
success probability pn and length n. Assume that limn→∞ npn = λ for some λ ≥ 0. Show
3.8 Exercises 99

that if λ > 0, then µn converges weakly to the Poisson distribution with parameter λ. Show
that if λ = 0, then µn converges weakly to the Dirac measure at zero. ◦

Exercise 3.5. Let Xn be a random variable which is Beta distributed with shape parameters
√
(n, n). Define Yn = 8n(Xn − 12 ). Show that Yn has density with respect to the Lebesgue
measure. Show that the densities converge pointwise to the density of the standard normal
distribution. Argue that Yn converges in distribution to the standard normal distribution. ◦

Exercise 3.6. Let µ be a probability measure on (R, B) with cumulative distribution function
F . Let q : (0, 1) → R be a quantile function for µ, meaning that for all 0 < p < 1, it holds
that F (q(p)−) ≤ p ≤ F (q(p)). Let µn be the probability measure on (R, B) given by putting
Pn
µn (B) = n1 k=1 1(q(k/(n+1))∈B) for B ∈ B. Show that µn converges weakly to µ. ◦

Exercise 3.7. Let (ξn ) and (σn ) be sequences in R, where σn > 0. Let µn denote the normal
distribution with mean ξn and variance σn2 . Show that µn converges weakly if and only if ξn
and σn both converge. In the affirmative case, identify the limiting distribution. ◦

Exercise 3.8. Let (µn ) be a sequence of probability measures on (R, B) such that µn has
cumulative distribution function Fn . Let µ be some other probability measure with cumu-
lative distribution function F . Assume that F is continuous and assume that µn converges
weakly to µ. Let (xn ) be a sequence of real numbers converging to some point x. Show that
limn→∞ Fn (xn ) = F (x). ◦

Exercise 3.9. Let µn be the measure on (R, B) concentrated on {k/n | k ≥ 1} such that
µn ({k/n}) = n1 (1 − n1 )k−1 for each k ∈ N. Show that µn is a probability measure and that
µn converges weakly to the standard exponential distribution. ◦

Exercise 3.10. Calculate the characteristic function of the binomial distribution with suc-
cess parameter p and length n. ◦

Exercise 3.11. Calculate an explicit expression for the characteristic function of the Poisson
distribution with parameter λ. ◦

Exercise 3.12. Consider a probability space endowed with two independent variables X
and Y with distributions µ and ν, respectively, where µ has characteristic function ϕ and ν
has characteristic function φ. Show that the variable XY has characteristic function ψ given
R
by ψ(θ) = ϕ(θy) dν(y). ◦
100 Weak convergence

Exercise 3.13. Consider a probability space endowed with four independent variables X,
Y , Z and W , all standard normally distributed. Calculate the characteristic function of
XY − ZW and argue that XY − ZW follows a Laplace distribution. ◦

Exercise 3.14. Assume (Xn ) is a sequence of independent random variables. Assume that
Pn
there exists β > 0 such that |Xn | ≤ β for all n ≥ 1. Define Sn = k=1 Xk . Prove that if
P∞ √
it holds that n=1 V Xn is infinite, then (Sn − ESn )/ V Sn converges in distribution to the
standard normal distribution. ◦

Exercise 3.15. Let (Xn ) be a sequence of independent random variables. Let ε > 0. Show
Pn
that if k=1 Xk converges almost surely as n tends to infinity, then the following three series
P∞ P∞ P∞
are convergent: n=1 P (|Xn | > ε), n=1 EXn 1(|Xn |≤ε) and n=1 V Xn 1(|Xn |≤ε) . ◦

Exercise 3.16. Consider a measurable space (Ω, F) endowed with a sequence (Xn ) of
random variables as well as a family of probability measures (Pλ )λ>0 such that under Pλ ,
(Xn ) consists of independent and identically distributed variables such that Xn follows a
Pn
Poisson distribution with mean λ for some λ > 0. Let X n = n1 k=1 Xk . Find a mapping
f : (0, ∞) → (0, ∞) such that for each λ > 0, it holds that under Pλ , f (X n ) is asymptotically
normal with mean f (λ) and variance n1 . ◦

Exercise 3.17. Let (Xn ) be a sequence of independent random variables such that Xn has
Pn P
mean ξ and unit variance. Put Sn = k=1 Xk . Let α > 0. Show that (Sn − nξ)/nα −→ 0 if
and only if α > 1/2. ◦

Exercise 3.18. Let θ > 0 and let (Xn ) be a sequence of independent and identically
distributed random variables such that Xn follows a normal distribution with mean θ and
variance θ. The maximum likelihood estimator for estimation of θ based on n samples is
Pn
θ̂n = − 21 + ( 14 + n1 k=1 Xk2 )1/2 . Show that θ̂n is asymptotically normal with mean θ and
3
+2θ 2
variance n1 4θ4θ2 +2θ+1 . ◦

Exercise 3.19. Let µ > 0 and let (Xn ) be a sequence of independent and identically
distributed random variables such that Xn follows an exponential distribution with mean
Pn −1
1/µ. Let X n = n1 k=1 Xk . Show that X n and X n are asymptotically normal and identify
Pn P
the asymptotic parameters. Define Yn = log1 n k=1 Xkk . Show that Yn −→ 1/µ. ◦

Exercise 3.20. Let θ > 0 and let (Xn ) be a sequence of independent and identically
distributed random variables such that Xn follows a uniform distribution on [0, θ]. Define
Pn
X n = n1 k=1 Xk . Show that X n is asymptotically normal with mean µ and variance n1 µ2 .
3.8 Exercises 101

Pn P
Next, put Yn = n42 k=1 kXk . Demonstrate that Yn −→ θ. Use Lyapounov’s central limit
p
theorem to show that (Yn − θ)/( 4θ2 /9n) converges to a standard normal distribution. ◦

Exercise 3.21. Let (Xn , Yn ) be a sequence of independent and identically distributed vari-
ables such that for each n ≥, Xn and Yn are independent, where Xn follows a standard normal
distribution and Yn follows an exponential distribution with mean α for some α > 0. Define
Pn Pn
Sn = n1 k=1 Xk + Yk and Tn = n1 k=1 Xk2 . Show that (Sn , Tn ) is asymptotically normally
√
distributed and identify the asymptotic parameters. Show that Sn / Tn is asymptotically
normally distributed and identify the asymptotic parameters. ◦

Exercise 3.22. Let (Xn ) be a sequence of independent and identically distributed variables
such that Xn follows a normal distribution with mean µ and variance σ 2 for some σ > 0.
Pn Pn
Assume that µ 6= 0. Define X n = n1 k=1 Xk and Sn2 = n1 k=1 (Xk − X n )2 . Show that
Sn /X n is asymptotically normally distributed and identify the asymptotic parameters. ◦
102 Weak convergence
Chapter 4

Signed measures and

conditioning

In this chapter, we will consider two important but also very distinct topics: Decompositions
of signed measures and conditional expectations. The topics are only related by virtue of the
fact that we will use results from the first section to prove the existence of the conditional
expectations to be defined in the following section. In the first section, the framework is a
measurable space that will be equipped with a so–called signed measure. In the rest of the
chapter, the setting will a probability space endowed with a random variable.

4.1 Decomposition of signed measures

In this section, we first introduce a generalization of bounded measures, namely bounded,

signed measures, where negative values are allowed. We then show that a signed measure can
always be decomposed into a difference between two positive, bounded measures. Afterwards
we prove the main result of the section, stating how to decompose a bounded, signed measure
with respect to a bounded, positive measure. Finally we will show the Radon–Nikodym
theorem, which will be crucial in Section 4.2 in order to prove the existence of conditional
expectations.
104 Signed measures and conditioning

In the rest of the section, we let (Ω, F) be a measurable space. Recall that µ : F → [0, ∞] is
a measure on (Ω, F) if µ(∅) = 0 and for all disjoint sequences F1 , F2 , . . . it holds that
∞
[ X∞
µ Fn = µ(Fn ) .
n=1 n=1

If µ(Ω) < ∞ we say that µ is a finite measure. However, in the context of this section,
we shall most often use the name bounded, positive measure. A natural generalisation of
a bounded, positive measure is to allow negative values. Hence we consider the following
definition:

Definition 4.1.1. A bounded, signed measure ν on (Ω, F) is a map ν : F → R such that

(1) sup{|ν(F )| | F ∈ F} < ∞,

∞ ∞
(2) ν( ∪ Fn ) = Σ ν(Fn ) for all pairwise disjoint F1 , F2 , . . . ∈ F.
n=1 n=1

Note that condition (2) is similar to the σ–additivity condition for positive measures. Con-
dition (1) ensures that ν is bounded.

A bounded, signed measure has further properties that resemble properties of positive mea-
sures:

Theorem 4.1.2. Assume that ν is a bounded, signed measure on (Ω, F). Then

(1) ν(∅) = 0 .

(2) ν is finitely additive: If F1 , . . . , FN ∈ F are disjoint sets, then

N
[ N
X
ν Fn = ν(Fn ) .
n=1 n=1

(3) ν is continuous: If Fn ↑ F or Fn ↓ F , with F1 , F2 , . . . ∈ F, then

ν(Fn ) → ν(F )

Proof. To prove (1) let F1 = F2 = · · · = ∅ in the σ–additivity condition. Then we can utilize
S∞
the simple fact ∅ = n=1 ∅ such that
∞
[ ∞
X
ν(∅) = ν ∅ = ν(∅)
n=1 n=1
4.1 Decomposition of signed measures 105

which can only be true if ν(∅) = 0.

Considering the second result, let FN +1 = FN +2 = · · · = ∅ and apply the σ–additivity again
such that
N
[ N
[ ∞
[ XN ∞
X N
X
ν Fn = ν Fn ∪ ∅ = ν(Fn ) + 0= ν(Fn ) .
n=1 n=1 n=N +1 n=1 n=N +1 n=1

Finally we demonstrate the third result in the case where Fn ↑ F . Define G1 = F1 , G2 =

F2 \ F1 , G3 = F3 \ F2 , . . .. Then G1 , G2 , . . . are disjoint with
N
[ ∞
[
Gn = FN , Gn = F ,
n=1 n=1

so
N
[ N ∞
X X
ν(FN ) = ν Gn = ν(Gn ) → ν(Gn ) = ν(F ) as N → ∞.
n=1 n=1 n=1

From the definition of a bounded, signed measure and Theorem 4.1.2 we almost immediately
see that bounded, signed measures with non–negative values are in fact bounded, positive
measures.

Theorem 4.1.3. Assume that ν is a bounded, signed measure on (Ω, F). If ν only has values
in [0, ∞), then ν is a bounded, positive measure.

Proof. That ν is a measure in the classical sense follows since it satisfies the σ–additivity
condition, and we furthermore have ν(∅) = 0 according to (1) in Theorem 4.1.2. That
ν(Ω) < ∞ is obviously a consequence of (1) in Definition 4.1.1.

Example 4.1.4. Let Ω = {1, 2, 3, 4} and assume that ν is a bounded, signed measure on Ω
given by
ν({1}) = 2 ν({2}) = −1 ν({3}) = 4 ν({4}) = −2 .

Then e.g.
ν({1, 2}) = 1 ν({3, 4}) = 2

and
ν(Ω) = 3 .
106 Signed measures and conditioning

so we see that although {3} ( Ω, it is possible that ν({3}) > ν(Ω). Hence, the condition 1 in
the definition is indeed meaningful: Only demanding that ν(Ω) < ∞ as for positive measures
would not ensure that ν is bounded on all sets, and in particular ν(Ω) is not necessarily an
upper bound for ν. ◦

Recall, that if µ is a bounded, positive measure then, for F1 , F2 ∈ F with F1 ⊆ F2 , it holds

that µ(F1 ) ≤ µ(F2 ). Hence, condition (1) in Definition 4.1.1 will (for µ) be equivalent to
µ(Ω) < ∞.

If ν is a bounded, signed measure and F1 , F2 ∈ F with F1 ⊆ F2 then it need not hold that
ν(F1 ) ≤ ν(F2 ):

In general we have using the finite additivity that

ν(F2 ) = ν(F1 ) + ν(F2 \F1 ),

but ν(F2 \F1 ) ≥ 0 and ν(F2 \F1 ) < 0 are both possible.

Recall from classical measure theory that new positive measures can be constructed by in-
tegrating non–negative functions with respect to other positive measures. Similarly, we can
construct a bounded, signed measure by integrating an integrable function with respect to a
bounded, positive measure.

Theorem 4.1.5. Let µ be a bounded, positive measure on (Ω, F) and let f : (Ω, F) → (R, B)
be a µ-integrable function, i.e. Z
|f | dµ < ∞.

Then Z
ν(F ) = f dµ (F ∈ F) (4.1)
F

defines a bounded, signed measure on (Ω, F). Furthermore it holds that ν is a bounded,
positive measure if and only if f ≥ 0 µ-a.e. (almost everywhere).

Proof. For all F ∈ F we have

Z Z
|ν(F )| ≤ |f | dµ ≤ |f | dµ < ∞
F

which gives (1). To obtain that (2) is satisfied, let F1 , F2 , . . . ∈ F be disjoint and define
F = ∪Fn . Observe that
|1∪Nn=1 Fn
f | ≤ |f |
4.1 Decomposition of signed measures 107

for all N ∈ N. Then dominated convergence yields

∞
[ Z Z Z
ν Fn = ν(F ) = f dµ = lim 1SN Fn f dµ = lim 1SN
n=1 Fn
f dµ
F N →∞ n=1 N →∞
n=1
N Z
X N
X ∞
X
= lim f dµ = lim ν(Fn ) = ν(Fn ) .
N →∞ Fn N →∞
n=1 n=1 n=1

The last statement follows from Theorem 4.1.3, since f ≥ 0 µ–a.e. implies that ν(F ) ≥ 0 for
all F ∈ F.

In the following definition we introduce two possible relations between a signed measure
and a positive measure. A main result in this chapter will be that the two definitions are
equivalent.

Definition 4.1.6. Assume that µ is a bounded, positive measure, and that ν is a bounded,
signed measure on (Ω, F).

(1) ν is absolutely continuous with respect to µ, (we write ν µ) if µ(F ) = 0 implies

ν(F ) = 0.

(2) ν has density with respect to µ if there exists a µ-integrable function f (the density),
dν
such that (4.1) holds. If ν has density with respect to µ we write ν = f · µ and f = dµ . f is
called the Radon-Nikodym derivative of ν with respect to µ.

Lemma 4.1.7. Assume that µ is a bounded, positive measure on (Ω, F) and that ν is a
bounded signed measure on (Ω, F). If ν = f · µ, then ν µ.

Proof. Choose F ∈ F with µ(F ) = 0. Then 1F f = 0 µ–a.e. so

Z Z Z
ν(F ) = f dµ = 1F f dµ = 0 dµ = 0 ,
F

and the proof is complete.

The following definition will be convenient as well:

Definition 4.1.8. Assume that ν, ν1 , and ν2 are bounded, signed measures on (Ω, F).

(1) ν is concentrated on F ∈ F if ν(G) = 0 for all G ∈ F with G ⊆ F c .

108 Signed measures and conditioning

(2) ν1 and ν2 are singular (we write ν1 ⊥ ν2 ), if there exist disjoint sets F1 , F2 ∈ F such
that ν1 is concentrated on F1 and ν2 is concentrated on F2 .

Example 4.1.9. Let µ be a bounded, positive measure on (Ω, F), and assume that f is
µ-integrable. Define the bounded, signed measure ν by ν = f · µ. Then ν is concentrated
on (f 6= 0): Take G ⊆ (f 6= 0)c , or equivalently G ⊆ (f = 0). Then 1G f = 0, so ν(G) =
R R
1G f dµ = 0 dµ = 0 and we have the result.

Now assume that both f1 and f2 are µ-integrable and define ν1 = f1 · µ and ν2 = f2 · µ. Then
it holds that ν1 ⊥ ν2 if (f1 6= 0) ∩ (f2 6= 0) = ∅.

In fact, the result would even be true if we only have µ((f1 6= 0) ∩ (f2 6= 0)) = 0 (why?). ◦

Lemma 4.1.10. Let ν be a bounded, signed measure on (Ω, F). If ν is concentrated on

F ∈ F then ν is also concentrated on any G ∈ F with G ⊇ F .

A bounded, positive measure µ on (Ω, F) is concentrated on F ∈ F if and only if µ(F c ) = 0.

Proof. To show the first statement, assume that ν is concentrated on F ∈ F, and let G ∈ F
satisfy G ⊇ F . For any set G0 ⊆ Gc we have that G0 ⊆ F c , so by the definition we have
ν(G0 ) = 0.

For the second result, we only need to show that if µ(F c ) = 0, then µ is concentrated on F .
So assume that G ⊆ F c . Then, since µ is assumed to be a positive measure, we have

0 ≤ µ(G) ≤ µ(F c ) = 0

and we have that µ(G) = 0 as desired.

The following theorem is a deep result from classical measure theory, stating that any
bounded, signed measure can be constructed as the difference between two bounded, positive
measures.

Theorem 4.1.11. (The Jordan-Hahn decomposition). A bounded, signed measure ν can be

decomposed in exactly one way,
ν = ν+ − ν−,

where ν + , ν − are positive, bounded measures and ν + ⊥ ν − .

4.1 Decomposition of signed measures 109

Proof. The existence: Define λ = inf{ν(F ) : F ∈ F}. Then −∞ < λ ≤ 0 and for all n ∈ N
there exists Fn ∈ F with
1
ν(Fn ) ≤ λ + n .
2
We first show that with G = (Fn evt.) = ∪∞ ∞
n=1 ∩k=n Fk it holds that

ν(G) = λ . (4.2)

Note that
∞
\ ∞ \
[ ∞
Fk ↑ Fk = G as n → ∞
k=n n=1 k=n

so since ν is continuous, we have

∞
\
ν(G) = lim ν Fk .
n→∞
k=n

∞
Similarly we have ∩N
k=n Fk ↓ ∩k=n Fk (as N → ∞) so

N
\
ν(G) = lim lim ν Fk .
n→∞ N →∞
k=n

Let n be fixed and suppose it is shown that for all N ≥ n

N N
\ X 1
ν Fk ≤ λ + . (4.3)
2k
k=n k=n
P∞ 1
PN 1
Since we have that k=1 2k < ∞, we must have that limn→∞ limN →∞ k=n 2k = 0. Hence
N
X 1
λ ≤ ν(G) ≤ λ + lim lim = λ + 0 = λ.
n→∞ N →∞ 2k
k=n

So we have that ν(G) = λ, if we can show (4.3). This is shown by induction for all N ≥ n.
If N = n the result is trivial from the choice of Fn :
N N
\ 1 X 1
ν Fk = ν(Fn ) ≤ λ + n = λ + .
2 2k
k=n k=n

If (4.3) is true for N − 1 we obtain

N
\ −1
N\ −1
N\ −1
N\
ν Fk = ν Fk ∩ FN = ν Fk + ν(FN ) − ν Fk ∪ FN
k=n k=n k=n k=n
N −1 N
X 1 1 X 1
≤ λ+ + λ + − λ = λ + .
2k 2N 2k
k=n k=n
110 Signed measures and conditioning

In the inequality we have used (4.3) for N − 1, the definition of FN , and that ν(F ) ≥ λ for
all F ∈ F.

We have thus shown ν(G) = λ and may now define, for F ∈ F,

ν − (F ) = −ν(F ∩ G), ν + (F ) = ν(F ∩ Gc ).

Obviously e.g.
sup{|ν − (F )| : F ∈ F} ≤ sup{|ν(F )| : F ∈ F} < ∞
and
∞
[ [∞ ∞
[
ν− Fn = −ν ( Fn ) ∩ G = −ν (Fn ∩ G)
n=1 n=1 n=1
∞
X ∞
X
=− ν(Fn ∩ G) = ν − (Fn )
n=1 n=1

for F1 , F2 , . . . ∈ F disjoint sets, so ν + and ν − are bounded, signed measures. It is easily seen
(since G and Gc are disjoint) that ν = ν + −ν − . We furthermore have that ν − is concentrated
on G, since for F ⊆ Gc
ν − (F ) = −ν(F ∩ G) = −ν(∅) = 0 .
Similarly ν + is concentrated on Gc , so we must have ν − ⊥ ν + .

The existence part of the proof can now be completed by showing that ν + ≥ 0, ν − ≥ 0. For
F ∈ F we have F ∩ G = G \ (F c ∩ G) so

ν − (F ) = −ν(F ∩ G)
= −(ν(G) − ν(F c ∩ G))
= −λ + ν(F c ∩ G) ≥ 0

and

ν + (F ) = ν(F ) + ν − (F ) = ν(F ) − λ + ν(F c ∩ G)

= −λ + ν(F ∪ (F c ∩ G)) ≥ 0 ,

and the argument is complete.

The uniqueness: In order to show uniqueness of the decomposition, let ν = ν̃ + − ν̃ − be

another decomposition satisfying ν̃ + ≥ 0, ν̃ − ≥ 0 and ν̃ + ⊥ ν̃ − . Choose G̃ ∈ F to be a set,
such that ν̃ − is concentrated on G̃ and ν̃ + is concentrated on G̃c . Then for F ∈ F
(
c −ν − (F ∩ G̃c ) ≤ 0 , since it is a subset of G
ν(F ∩ G ∩ G̃ ) = ,
ν̃ + (F ∩ G) ≥ 0 , since it is a subset of G̃c
4.1 Decomposition of signed measures 111

and hence ν(F ∩ G ∩ G̃c ) = 0. Similarly we observe that ν(F ∩ Gc ∩ G̃) = 0, so

ν − (F ) = −ν(F ∩ G) = −ν(F ∩ G ∩ G̃) − ν(F ∩ G ∩ G̃c )

= −ν(F ∩ G ∩ G̃) = −ν(F ∩ G ∩ G̃) − ν(F ∩ Gc ∩ G̃)
= −ν(F ∩ G̃) = ν̃ − (F )

for all F ∈ F.

Example 4.1.12. If ν = f · µ, where µ is a bounded, positive measure and f is µ-integrable,

we see that the decomposition is given by

ν + = f + · µ, ν − = f − · µ,

where f + = f ∨ 0 and f − = −(f ∧ 0) denote the positive and the negative part of f ,
respectively. The argument is by inspection: It is clear that ν = ν + − ν − , ν + ≥ 0, ν − ≥ 0
and moreover ν + ⊥ ν − since ν + is concentrated on (f ≥ 0) and ν − is concentrated on
(f < 0). ◦

Theorem 4.1.13 (The Lebesgue decomposition). If ν is a bounded, signed measure on (Ω, F)

and µ is a bounded, positive measure on Ω, F) then there exists a F–measurable, µ–integrable
function f and a bounded, signed measure νs with νs ⊥ µ such that

ν = f · µ + νs .
∼ ∼
The decomposition is unique in the sense that if ν = f · µ + ν s is another decomposition, then
∼ ∼
f = f µ − a.e., ν s = νs .

If ν ≥ 0, then f ≥ 0 µ − a.e. and νs ≥ 0.

Proof. We begin with the uniqueness part of the theorem: Assume that
∼ ∼
ν = f · µ + νs = f · µ + ν s ,
∼ ∼
where µ ⊥ νs and µ ⊥ ν s . Choose F0 , F 0 ∈ F such that νs is concentrated on F0 , µ is
∼ ∼ ∼
concentrated on F0c , ν s is concentrated on F 0 and µ is concentrated on F c0 . Define G0 =
∼
F0 ∪ F 0 . According to Lemma 4.1.10 we only need to show that µ(G0 ) = 0 in order to
conclude that µ is concentrated on Gc0 , since µ ≥ 0. This is true since

0 ≤ µ(G0 ) = µ(F0 ∪ F̃0 ) ≤ µ(F0 ) + µ(F̃0 ) = 0 + 0 = 0 .

112 Signed measures and conditioning

Furthermore we have that νs is concentrated on G0 since F0 ⊆ G0 . Similarly ν̃s is concen-

trated on G0 . Then for F ∈ F

ν̃s (F ) − νs (F ) = ν̃s (F ∩ G0 ) − νs (F ∩ G0 )
= ν(F ∩ G0 ) − (f˜µ)(F ∩ G0 ) − ν(F ∩ G0 ) − (f µ)(F ∩ G0 )

= ν(F ∩ G0 ) − 0 − ν(F ∩ G0 ) − 0 = 0 ,

where we have used µ(F ∩ G0 ) = 0 such that

Z Z
f dµ = 0 and f˜ dµ = 0 .
F ∩G0 F ∩G0

Then νs = ν̃s . The equation f · µ + νs = f˜ · µ + ν̃s gives f · µ = f˜ · µ, which leads to f = f˜

µ–a.e.

To prove existence, it suffices to consider the case ν ≥ 0. For a general ν we can find the
Jordan–Hahn decomposition, ν = ν + − ν − , and then apply the Lebesgue decomposition to
ν + and ν − separately:
ν + = f µ + νs and ν − = gµ + κs
where there exist F0 and F̃0 such that νs is concentrated on F0 , µ is concentrated on F0c , κs
is concentrated on F̃0 , and µ is concentrated on F̃0c . Defining G0 = F0 ∪ F̃0 we can obtain,
similarly to the argument above, that

νs , κs both are concentrated on G0 and µ is concentrated on Gc0 .

Obviously, the bounded, signed measure νs − κs is then concentrated on G0 as well, leading

to νs − κs ⊥ µ. Writing
ν = (f − g)µ + (νs − κs )
gives the desired decomposition.

So assume that ν ≥ 0. Let L(µ)+ denote the set of non–negative, µ–integrable functions and
define n Z o
H = g ∈ L(µ)+ ν(F ) ≥ g dµ for all F ∈ F
F
Recall that ν ≥ 0 such that e.g. 0 ∈ H. Define furthermore
nZ o
α = sup g dµ g ∈ H .
R
Since Ω
g dµ ≤ ν(Ω) for all g ∈ H, we must have

0 ≤ α ≤ ν(Ω) < ∞ .
4.1 Decomposition of signed measures 113

R
We will show that there exists f ∈ H with f dµ = α.

Note that if h1 , h2 ∈ H then h1 ∨ h2 ∈ H: For F ∈ F we have

Z Z Z
h1 ∨ h2 dµ = h1 dµ + h2 dµ
F F ∩(h1 ≥h2 ) F ∩(h1 <h2 )

≤ ν(F ∩ (h1 ≥ h2 )) + ν(F ∩ (h1 < h2 )) = ν(F ) .

At the inequality it is used that both h1 and h2 are in H.

Now, for each n ∈ N, choose gn ∈ H so that

Z
1
gn dµ ≥ α −
n
and define fn = g1 ∨ · · · ∨ gn for each n ∈ N. According to the result shown above, we have
fn ∈ H and furthermore it is seen that the sequence (fn ) is increasing. Then the pointwise
limit f = limn→∞ fn exists, and by monotone convergence we obtain for F ∈ F that
Z Z
f dµ = lim fn dµ ≤ ν(F ) .
F n→∞ F

Hence f ∈ H. Furthermore we have for all n ∈ N that f ≥ gn , so

Z Z
1
f dµ ≥ gn dµ ≥ α −
n
R
leading to the conclusion that f dµ = α .

Now we can define the bounded measure νs by

νs = ν − f · µ

Then νs ≥ 0 since f ∈ H such that

Z
νs (F ) = ν(F ) − f dµ ≥ 0
F

for all F ∈ F.

What remains in the proof is showing that νs ⊥ µ. For all n ∈ N define the bounded, signed
measure (see e.g. Exercise 4.1) λn by
1
λ n = νs − µ
n
−
Let λn = λ+
n − λn be the Jordan–Hahn decomposition of λn . Then we can find Fn ∈ F such
that
λ−
n is concentrated on Fn and λ+ c
n is concentrated on Fn
114 Signed measures and conditioning

For F ∈ F and F ⊆ Fnc we obtain

Z Z
1
ν(F ) = νs (F ) + f dµ = λn (F ) + µ(F ) + f dµ
F n F
Z Z
1 1
= λ+
n (F ) + µ(F ) + f dµ ≥ f + dµ .
n F F n
If we define (
f on Fn
f˜n = 1
,
f+ n on Fnc
then for F ∈ F
Z Z Z
1
f˜n dµ = f dµ + f+ dµ
F F ∩Fn F ∩Fnc n
≤ ν(F ∩ Fn ) + ν(F ∩ Fnc ) = ν(F )

so f˜n ∈ H. Hence
Z Z
1 1
α≥ f˜n dµ = f dµ + µ(Fnc ) = α + µ(Fnc ) .
n n
This implies that µ(Fnc ) = 0 leading to
∞
[
µ Fnc = 0 .
n=1

Thus µ is concentrated on F0 = (∪∞ c c ∞

1 Fn ) = ∩1 Fn . Finally, we have for all n ∈ N (recall
that λ+ c
n is concentrated on Fn ) that

1
0 ≤ νs (F0 ) ≤ νs (Fn ) = µ(Fn ) + λn (Fn )
n
1 1
= µ(Fn ) − λ−n (Fn ) ≤ µ(Ω)
n n
which for n → ∞ implies that νs (F0 ) = 0. Hence (since νs ≥ 0) νs is concentrated on F0c .

Theorem 4.1.14 (Radon-Nikodym). Let µ be a positive, bounded measure and ν a bounded,

signed measure on (Ω, F). Then ν µ if and only if there exists a F–measurable, µ–
integrable f such that ν = f · µ.

If ν µ then the density f is uniquely determined µ-a.e. If in addition ν ≥ 0 then f ≥ 0

µ-a.e.

Proof. That f is uniquely determined follows from the uniqueness in the Lebesgue decom-
position. Also ν ≥ 0 implies f ≥ 0 µ–a.e.
4.2 Conditional Expectations given a σ-algebra 115

In the ”if and only if” part it only remains to show, that ν µ implies the existence of a
F–measurable and µ–integrable function f with ν = f µ. So assume that ν µ and consider
the Lebesgue decomposition of ν
ν = f · µ + νs .
Choose F0 such that νs is concentrated on F0 and µ is concentrated on F0c . For F ∈ F we
then obtain that

νs (F ) = νs (F ∩ F0 ) = ν(F ∩ F0 ) − (f µ)(F ∩ F0 ) = 0

since µ(F ∩ F0 ) = 0 and since ν µ implies that ν(F ∩ F0 ) = 0. Hence νs = 0 and the claim
ν = f · µ follows.

4.2 Conditional Expectations given a σ-algebra

In this section we will return to considering a probability space (Ω, F, P ) and real random
variables defined on this space. We shall see how the existence of conditional expectations
can be shown using a Radon-Nikodym derivative. In the course MI the existence is shown
from L2 -theory using projetions on the subspace L2 (Ω, D, P ) of L2 (Ω, F, P ), when D ⊆ F is
a sub σ-algebra.

Let X be a real random variable defined on (Ω, F, P ) with E|X| < ∞. A conditional
expectation of X (given something) can be interpreted as a guess on the value of X(ω) based
on varying amounts of information about which ω ∈ Ω has been drawn. If we know nothing
about ω, then it is not possible to say very much about the value of X(ω). Perhaps the best
R
guess we can come up with is suggesting the value E(X) = X dP !

Now let D1 , . . . , Dn be a system of disjoint sets in F with ∪ni=1 Di = Ω, and assume that for
a given ω ∈ Ω, we know whether ω ∈ Di for each i = 1, . . . , n. Then we actually have some
information about the ω that has been drawn, and an educated guess on the value of X(ω)
may not be as simple as E(X) any more. Instead our guessing strategy will be
Z
1
guess on X(ω) = X dP if ω ∈ Di . (4.4)
P (Di ) Di

We are still using an integral of X, but we only integrate over the set Di , where we know that
ω is an element. It may not be entirely clear, why this is a good strategy for our guess (that
will probably depend on the definition of a good guess), but at least it seems reasonable that
we give the same guess on X(ω) for all ω ∈ Di .
116 Signed measures and conditioning

Example 4.2.1. Suppose Ω = {a, b, c, d} and that the probability measure P is given by
1
P ({a}) = P ({b}) = P ({c}) = P ({d}) =
4
and furthermore that X : Ω → R is defined by

X(a) = 5 X(b) = 4 X(c) = 3 X(d) = 2 .

If we know nothing about ω then the guess is

5+4+3+2
E(X) = = 3.5 .
4
Now let D = {a, b} and assume that we want to guess X(ω) in a situation where we know
whether ω ∈ D or ω ∈ Dc . The strategy described above gives that if ω ∈ D then
1
R
X dP (5 + 4)
guess on X(ω) = D = 4 1 = 4.5 .
P (D) 2

Similarly, if we know ω ∈ Dc = {c, d}, then the best guess would be 2.5. Given the knowledge
of whether ω ∈ D or ω ∈ Dc we can write the guess as a function of ω, namely

guess(ω) = 1D (ω) · 4.5 + 1Dc (ω) · 2.5 .

Note that the collection {D1 , . . . , Dn } is a sub σ–algebra of F. The stability requirements for
σ–algebras fit very well into the knowledge of whether ω ∈ Di for all i: If we know whether
ω ∈ Di , then we also know whether ω ∈ Dic . And we know whether ω ∈ ∪Ai , if we know
whether ω ∈ Ai for all i.

The concept of conditional expectations takes the guessing strategy to a general level, where
the conditioning σ-algebra D is general sub σ–algebra of F. The result will be a random
variable (as in Example 4.2.1) which we will denote E(X|D) and call the conditional expecta-
tion of X given D. We will show in Example 4.2.5 that when D has the form {D1 , . . . , Dn }
as above, then E(X|D) is given by (4.4).

Note that one cannot in general use E(X|D) = X (even though it satisfies (1) and (2)): X
is assumed to be F-measurable but need not be D-measurable.

Given this definition, conditional expectations are almost surely unique:

∼ ∼
Theorem 4.2.3. (1) If U and U are both conditional expectations of X given D, then U = U
a.s.
∼ ∼
(2) If U is a conditional expectation of X given D and U is D-measurable with U = U a.s.
∼
then U is also a conditional expectation of X given D.

∼
Proof. For the first result consider, e.g., D = (U > U ). Then
Z ∼ Z ∼ Z Z Z
(U − U ) dP = U dP − U dP = X dP − X dP = 0
D D D D D

∼ ∼
according to (2) in Definition 4.2.2 . But U > U on D, so therefore P (D) = P (U > U ) = 0.
∼
Similarly, P (U < U ) = 0.

The second statement is trivial: Simply use that

Z Z
E|Ũ | = E|U | and Ũ dP = U dP
D D

so Ũ satisfies (1) and (2).

Theorem 4.2.4. If X is a real random variable with E|X| < ∞, then there exists a condi-
tional expectation of X given D.

Proof. Define for D ∈ D Z

ν(D) = X dP .
D

Then ν is a bounded, signed measure on (Ω, D). Let P 0 denote the restriction of P to D:
P 0 is the probability measure on (Ω, D) given by

P 0 (D) = P (D)

for all D ∈ D. Now we obviously have for all D ∈ D

P 0 (D) = 0 ⇒ P (D) = 0 ⇒ ν(D) = 0 ,

118 Signed measures and conditioning

so ν P 0 . According to the Radon–Nikodym Theorem we can find the Radon–Nikodym

derivative U = dν/dP 0 satisfying
Z
ν(D) = U dP 0 .
D

By construction in the Radon–Nikodym Theorem, U is automatically D–measurable and

P 0 –integrable. For all D ∈ D we now have that
Z Z Z
X dP = ν(D) = U dP 0 = U dP (4.5)
D D D

so it is shown that U is a conditional expectation of X given D.

That the last equation in (4.5) is true is just basic measure theory: The integral of a function
with respect to some measure does not change if the measure is extended to a larger σ–
algebra; the function is also a measurable function on the larger measurable space.

A direct argument could be first looking at indicator functions. Let D ∈ D and note that 1D
is D–measurable. Then
Z Z
1D dP 0 = P 0 (D) = P (D) = 1D dP .

Then it follows that Z Z

0
Y dP = Y dP

if Y is a linear combination of indicator functions, and finally the result is shown to be true
for general D–measurable functions Y by a standard approximation argument.

Example 4.2.5. consider a probability space (Ω, F, P ) and a real random variable X defined
on Ω with E|X| < ∞. Assume that D1 , . . . , Dn ∈ F form a partition of Ω: Di ∩ Dj = ∅ for
i 6= j and ∪ni=1 Di = Ω. Also assume (for convenience) that P (Di ) > 0 for all i = 1, . . . , n.
Let D be the σ–algebra generated by the Dj –sets. Then D ∈ D if and only if D is a union
of some Dj ’s.

We will show that

n Z
X 1
U= X dP 1Di .
i=1
P (Di ) Di

is a conditional expectation of X given D. First note that U is D–measurable, since the

4.2 Conditional Expectations given a σ-algebra 119

indicator functions 1Di are D–measurable. Furthermore

n Z Z
X 1
E|U | ≤ X dP 1Di dP
i=1
P (Di ) Di
n
X 1 Z Z
≤ |X| dP 1Di dP
i=1
P (Di ) Di
Xn Z
= |X| dP = E|X| < ∞ .
i=1 Di

Finally let D ∈ D. Then D = ∪m

k=1 Dik for some 1 ≤ i1 < · · · < im ≤ n. We therefore obtain

Z m Z
X
X dP = X dP
D k=1 Dik

so
Z m Z n Z
X 1
X
U dP = X dP 1Di dP
D P (Di ) Di
k=1 Dik i=1
m Z Z
X 1
= X dP 1Dik dP
P (Dik ) Di
k=1 Dik k
m Z Z
X 1
= X dP 1Dik dP
P (Dik ) Di Dik
k=1 k

Xm Z Z
= X dP = X dP .
k=1 Dik D

Hence U satisfies the conditions in Definition 4.2.2 and is therefore a conditional expectation
of X given D. ◦

We shall now show a series of results concerning conditional expectations. The results and the
proofs are well-known from the course MI. In Theorem 4.2.6, X, Y and Xn are real random
variables, all of which are integrable.

Theorem 4.2.6. (1) If X = c a.s., where c ∈ R is a constant, then E(X|D) = c a.s.

(2) For α, β ∈ R it holds that

E(αX + βY |D) = αE(X|D) + βE(Y |D) a.s.

(3) If X ≥ 0 a.s. then E(X|D) ≥ 0 a.s. If Y ≥ X a.s. then E(Y |D) ≥ E(X|D) a.s.
120 Signed measures and conditioning

(4) If D ⊆ E are sub σ-algebras of F then

E(X|D) = E[E(X|E)|D] = E[E(X|D)|E] a.s.

(5) If σ(X) and D are independent then

E(X|D) = EX a.s.

(6) If X is D-measurable then

E(X|D) = X a.s.

(7) If it holds for all n ∈ N that Xn ≥ 0 a.s. and Xn+1 ≥ Xn a.s. with lim Xn = X a.s.,
then
lim E(Xn |D) = E(X|D) a.s.
n→∞

(8) If X is D-measurable and E|XY | < ∞, then

E(XY |D) = X E(Y |D) a.s.

(9) If f : R → R is a measurable function that is convex on an interval I, such that

P (X ∈ I) = 1 and E|f (X)| < ∞, then it holds that

f E(X|D) ≤ E f (X)|D) a.s.

Proof. (1) We show that the constant variable U given by U (ω) = c meets the conditions
from Definition 4.2.2. Firstly, it is D–measurable, since for B ∈ B we have
(
−1 Ω c∈B
U (B) =
∅ c∈ /B

which is D measurable in either case. Furthermore E|U | = |c| < ∞ and obviously
Z Z Z
U dP = c dP = X dP .
D D D
4.2 Conditional Expectations given a σ-algebra 121

(2) αE(X|D) + βE(Y |D) is D-measurable and integrable, so all we need to show is (see
part (2) of Definition 4.2.2), that
Z Z
(αE(X|D) + βE(Y |D)) dP = (αX + βY ) dP
D D
for all D ∈ D. But here, the left hand side is
Z Z
α E(X|D) dP + β E(Y |D) dP,
D D
which is seen to equal the right hand side when we use Definition 4.2.2 on both terms.

For the second claim, just use the first result on Y − X and apply (2).

Hence we have the result from Definition 4.2.2. In the first equality, it is used that E[E(X|E)|D]
is a conditional expectation of E(X|E) given D. In the second equality Definition 4.2.2 is
applied to E(X|E), using that D ∈ D ⊆ E.

(5) As in (1), the constant map ω 7→ EX is D–measurable and has finite expectation, so
it remains to show that for D ∈ D
Z Z
EX dP = X dP .
D D

The left hand side is EX · P (X). For the right hand side we obtain the following, using that
1D and X are independent
Z Z Z Z
X dP = 1D · X dP = 1D dP · X dP = P (D) · EX ,
D
122 Signed measures and conditioning

so the stated equality is true.

(6) Trivial.

(7) According to (3) we have for all n ∈ N that E(Xn+1 |D) ≥ E(Xn |D) a.s., so with

Fn = (E(Xn+1 |D) ≥ E(Xn |D)) ∈ D

we have P (Fn ) = 1. Let F0 = (E(X1 |D) ≥ 0) such that P (F0 ) = 1. With the definition
F = ∩∞ n=0 Fn we have F ∈ D and P (F ) = 1. For ω ∈ F it holds that the sequence

E(Xn |D)(ω) n∈N is increasing and E(X1 |D)(ω) ≥ 0. Hence for ω ∈ F the number Y (ω) =
limn→∞ E(Xn |D)(ω)

is well–defined in [0, ∞]. Defining e.g. Y (ω) = 0 for ω ∈ F c makes Y a D–measurable random
variable (since F is D–measurable, and Y is the point–wise limit of 1F E(Xn |D) that are all
R
D–measurable variables) with values in [0, ∞]. Thus the integrals G Y dP of Y makes sense
for all G ∈ F.

In particular we obtain the following for D ∈ D using monotone convergence in the third and
the sixth equality
Z Z Z Z
Y dP = Y dP = lim E(Xn |D) dP = lim E(Xn |D) dP
D D∩F D∩F n→∞ n→∞ D∩F
Z Z Z Z
= lim E(Xn |D) dP = lim Xn dP = lim Xn dP = X dP .
n→∞ D n→∞ D D n→∞ D

Letting D = Ω shows that E|Y | = EY = EX < ∞, so we can conclude Y = E(X|D) a.s..

Thereby we have shown (7).

We now prove the result for all X ≥ 0 and Y ≥ 0 by showing the equation in the following
steps:

(i) when X = 1D0 for D0 ∈ D

Pn
(ii) when X = k=1 ak 1Dk for Dk ∈ D and ak ≥ 0
4.2 Conditional Expectations given a σ-algebra 123

(iii) when X ≥ 0 is a general D–measurable variable

So firstly assume that X = 1D0 with D0 ∈ D. Then

E|1D0 E(Y |D)| ≤ E|E(Y |D)| < ∞

and since D ∩ D0 ∈ D we obtain

Z Z
1D0 E(Y |D) dP = E(Y |D) dP
D
ZD∩D0
= Y dP
D∩D0
Z
= 1D0 Y dP .
D

Hence formula (4.6) is shown in case (i). If

n
X
X= ak 1Dk
k=1

with Dk ∈ D and ak ≥ 0, we easily obtain (4.6) from linearity

Z n
X Z n
X Z Z
XE(Y |D) dP = ak 1Dk E(Y |D) dP = ak 1Dk Y dP = XE(Y |D) dP .
D k=1 D k=1 D D

For a general D–measurable X we can obtain X through the approximation

X = lim Xn ,
n→∞

where
n
n2
X k−1
Xn = 1( k−1
n ≤X<
k
) .
2n 2 2n
k=1

Note that all sets ( k−1

2n ≤ X <
k
2n ) are D–measurable, so each Xn have the form from step
(ii). Hence
Z Z
Xn E(Y |D) dP = Xn Y dP (4.7)
D D

for all n ∈ N. Furthermore the construction of (Xn )n∈N makes it non–negative and increasing.

Since we have assumed that Y ≥ 0 we must have XE(Y |D) ≥ 0 a.s. Thereby the inte-
R
gral D XE(Y |D) dP is defined for all D ∈ D (but it may be +∞). Since the sequence
124 Signed measures and conditioning

Hence (8) is shown in the case, where X ≥ 0 and Y ≥ 0. That (8) holds in general then
easily follows by splitting X and Y up into their positive and negative parts, X = X + − X −
and Y = Y + − Y − , and then applying the version of (8) that deals with positive X and Y
on each of the terms we obtain the desired result by multiplying out the brackets.

(9) The full proof is given in the lecture notes of the course MI. That

E(X|D)+ ≤ E(X + |D) a.s.

is shown in Exercise 4.11, and that

|E(X|D)| ≤ E(|X||D) a.s.

is shown in Exercise 4.12.

4.3 Conditional expectations given a random variable

In this section we will consider the special case, where the conditioning σ–algebra D is
generated by a random variable Y . So assume that Y : (Ω, F) → (E, E) is a random variable
with values in the space E that is not necessarily R. If D = σ(Y ), i.e. the σ-algebra generated
by Y , we write
E(X|Y )

rather than E(X|D) and the resulting random variable is referred to as the conditional
expectation of X given Y . Recall that D ∈ σ(Y ) is always of the form D = (Y ∈ A) for
some A ∈ E. Then we immediately have the following characterization of E(X|Y ):
4.3 Conditional expectations given a random variable 125

∼ ∼
Note that if σ(Y ) = σ(Y ), then E(X|Y ) = E(X|Y ) a.s. If e.g. Y takes values in the
real numbers and ψ : E → E is a bijective and bimeasurable map (ψ and ψ −1 are both
measurable), then
E(X|Y ) = E(X|ψ(Y )) a.s.

The following lemma will be extremely useful in the comprehension of conditional expecta-
tions given random variables.

Lemma 4.3.2. A real random variable Z is σ(Y )-measurable if and only if there exists a
measurable map φ : (E, E) → (R, B) such that

Z = φ ◦ Y.

Proof. First the easy implication:

Assume that Z = φ◦Y , where φ is E −B–measurable, and obviously Y is σ(Y )−E–measurable.

Then it is well–known that Z is σ(Y )–B–measurable.

Now assume that Z is σ(Y )–measurable. We can write Z as

n
n2
X k−1
Z = lim 1 k−1 k , (4.8)
n→∞ 2n ( 2n ≤Z< 2n )
k=−n2n

where each set ( k−1

2n ≤ Z <
k
2n ) ∈ σ(Y ), since Z is σ(Y )–measurable. Define the class H of
real random variables by

H = {Z 0 : there exists a E − B–measurable function ψ with Z 0 = ψ ◦ Y }

Because of the approximation (4.8) the argument will be complete, if we can show that H
has the following properties

(i) 1D ∈ H for all D ∈ σ(Y )

126 Signed measures and conditioning

(ii) a1 Z1 + · · · + an Zn ∈ H if Z1 , . . . , Zn ∈ H and a1 , . . . , an ∈ R

(iii) Z ∈ H, where Z = limn→∞ Zn and all Zn ∈ H

because in that case we will have shown that Z ∈ H.

(i): Assume that D ∈ σ(Y ). Then there exists a set A ∈ E such that D = (Y ∈ A) (simply
from the definition of σ(Y )). But then 1D = 1A ◦ Y , since

1D (ω) = 1 ⇔ ω ∈ D = (Y ∈ A)
⇔ Y (ω) ∈ A
⇔ (1A ◦ Y )(ω) = 1 .

We have that 1A is E − B–measurable, so (i) is shown.

(ii): Assume that Zk = φk ◦ Y , where φk is E − B–measurable, for k = 1, . . . , n. Then we

get !
n
X n
X n
X
ak Zk (ω) = ak (φk (Y (ω))) = ak φk ◦ Y (ω) .
k=1 k=1 k=1
So
n
X n
X
a k Zk = ( ak φk ) ◦ Y ,
k=1 k=1
Pn
where k=1 ak φk is measurable. Hence we have shown (ii).

(iii): Assume that Zn = φn ◦ Y for all n ∈ N, where φn is E − B–measurable. Then

Z(ω) = lim Zn (ω) = lim (φn ◦ Y )(ω) = lim φn (Y (ω))

n→∞ n→∞ n→∞

for all ω ∈ Ω. In particular the limit limn→∞ φn (y) exists for all y ∈ Y (Ω) = {Y (ω) : ω ∈ Ω}.
Define φ : E → R by
(
limn→∞ φn (y) if the limit exists
φ(y) = ,
0 otherwise
then φ is E − B–measurable, since F = (limn φn exists) ∈ E and φ = limn (1F φn ) with each
1F φn being E − B–measurable. Furthermore note that Z(ω) = φ(Y (ω)), so Z = φ ◦ Y . Hence
(iii) is shown.

Now we return to the discussion of the σ(Y )-measurable random variable E(X|Y ). By the
lemma, we have that
E(X|Y ) = φ ◦ Y,
4.3 Conditional expectations given a random variable 127

for some measurable φ : E → R. We call φ(y) a conditional expectation of X given Y = y

and write
φ(y) = E(X|Y = y).
This type of conditional expectations is characterized in Theorem 4.3.3 below.

Theorem 4.3.3. A measurable map φ : E → R defines a conditional expectation of X given

Y = y for all y if and only if φ is integrable with respect to the distribution Y (P ) of Y and
Z Z
φ(y) dY (P )(y) = X dP (B ∈ E).
B (Y ∈B)

Proof. Firstly, assume that φ defines a conditional expectation of X given Y = y for all y.
Then we have E(X|Y ) = φ ◦ Y so
Z Z Z
|φ(y)| dY (P )(y) = |φ ◦ Y | dP = |E(X|Y )| dP = E|E(X|Y )| < ∞ ,
E Ω Ω

and we have shown that φ is Y (P )–integrable. Above, the first equality is a result of the
Change–of–variable Formula. Similarly we obtain for all B ∈ E
Z Z Z
φ(y) dY (P )(y) = E(X|Y ) dP = X dP .
B (Y ∈B) (Y ∈B)

Thereby the ”only if” claim is obtained.

Conversely, assume that φ is integrable with respect to Y (P ) and that

Z Z
φ(y) dY (P )(y) = X dP
B (Y ∈B)

for all B ∈ E.

Firstly, we note that φ◦Y is σ(Y )–measurable (as a result of the trivial implication in Lemma
4.3.2). Furthermore we have
Z Z
|φ ◦ Y | dP = |φ(y)| dY (P )(y) < ∞
Ω E

using the change–of–variable formula again (simply the argument from above backwards).
Finally for D ∈ σ(Y ) we have B ∈ E with D = (Y ∈ B) so
Z Z Z
φ ◦ Y dP = φ(y) dY (P )(y) = X dP ,
D B D

where we have used the assumption. This shows that φ ◦ Y is a conditional expectation of
X given Y , so φ ◦ Y = E(X|Y ). From that we have by definition, that φ(y) is a conditional
expectation of X given Y = y.
128 Signed measures and conditioning

4.4 Exercises

Exercise 4.1. Assume that ν1 and ν2 are bounded, signed measures. Show that αν1 + βν2
is a bounded, signed measure as well, when α, β ∈ R are real–valued constants, using the
(obvious) definition
(αν1 + βν2 )(A) = αν1 (A) + βν2 (A) .

Note that the definition ν µ also makes sense if µ is a positive measure (not necessarily
bounded).

Exercise 4.2. Let τ be the counting measure on N0 = N ∪ {0} (equipped with the σ–algebra
P(N0 ) that contains all subsets). Let µ be the Poisson distribution with parameter λ:

λn −λ
µ({n}) = e for n ∈ N0 .
n!
Show that µ τ . Does µ have a density f with respect to τ ? In that case find f .

Now let ν be the binomial distribution with parameters (N, p). Decide whether µ ν and/or
ν µ. ◦

Exercise 4.3. Assume that µ is a bounded, positive measure and that ν1 , ν2 µ are
bounded, signed measures. Show

d(ν1 + ν2 ) dν1 dν2

= + µ–a.e.
dµ dµ dµ
◦

Exercise 4.4. Assume that π, µ are bounded, positive measures and that ν is a bounded,
signed measure, such that ν π µ. Show that

dν dν dπ
= µ–a.e.
dµ dπ dµ
◦

Exercise 4.5. Assume that µ, ν are bounded, positive measures such that ν µ and µ ν.
Show −1
dν dµ
= .
dµ dν
4.4 Exercises 129

both ν–a.e. and µ–a.e. ◦

Exercise 4.6. Assume that ν is a bounded, signed measure and that µ is a σ–finite measure
R
with ν µ. Show that there exists f ∈ L(µ) such that ν = f · µ (meaning ν(F ) = F f dµ).
◦

In the following exercises we assume that all random variables are defined on a probability
space (Ω, F, P ).

Exercise 4.7. Let X and Y be random variables with E|X| < ∞ and E|Y | < ∞ that are
both measurable with respect to some sub σ–algebra D. Assume furthermore that
Z Z
X dP = Y dP for all D ∈ D .
D D

Show that X = Y a.s. ◦

Exercise 4.8. Assume that X1 and X2 are independent random variables satisfying X1 ∼
exp(β) and X2 ∼ N (0, 1). Define Y = X1 + X2 and the sub σ–algebra D by D = σ(X1 ).
Show that E(Y |D) = X1 a.s. ◦

Exercise 4.9. Assume that X is a real random variable with EX 2 < ∞ and that D is some
sub σ–algebra. Let Y = E(X|D). Show that

X(P ) = Y (P ) ⇔ X = Y a.s.

Exercise 4.10. Let X and Y be random variables with EX 2 < ∞ and EY 2 < ∞. The
conditional variance of X given the sub σ–algebra D is defined by

V (X|D) = E(X 2 |D) − (E(X|D))2

and the conditional covariance between X and Y given D is

Cov(X, Y |D) = E(XY |D) − E(X|D)E(Y |D)

Show that

V (X) = E(V (X|D)) + V (E(X|D))

Cov(X, Y ) = E(Cov(X, Y |D)) + Cov(E(X|D), E(Y |D))

◦
130 Signed measures and conditioning

Exercise 4.11. Let X be a real random variable with E|X| < ∞. Let D be a sub σ–algebra.
Show without referring to (9) in Theorem 4.2.6 that

(E(X|D))+ ≤ E(X + |D) a.s.

Exercise 4.12. Let X be a real random variable with E|X| < ∞. Let D be a sub σ–algebra.
Show without referring to (9) in Theorem 4.2.6 that

|E(X|D)| ≤ E(|X| |D) a.s.

Exercise 4.13. Let (Ω, F, P ) = ((0, 1), B, λ) (where λ is the Lebesgue measure on (0, 1)).
Define the real random variable X by

X(ω) = ω

and
D = {D ⊆ (0, 1) | D or Dc is countable} .

Then D is a sub σ–algebra of B (you can show this if you want...). Find a version of E(X|D).
◦

Exercise 4.14. Let (Ω, F, P ) = ([0, 1], B, λ), where λ is the Lebesgue measure on [0, 1].
Consider the two real valued random variables

X1 (ω) = 1 − ω X2 (ω) = ω 2

Show that for any given real random variable Y it holds that E(Y |X1 ) = E(Y |X2 ).

Show by giving an example that E(Y |X1 = x) and E(Y |X2 = x) may be different on a set
of x’s with positive Lebesgue measure. ◦

Exercise 4.15. Assume that X is a real random variable with E|X| < ∞ and that D is a
sub σ–algebra of F. Assume that Y is a D–measurable real random variable with E|Y | < ∞
that satisfies
E(X) = E(Y )

and Z Z
Y dP = X dP
D D
4.4 Exercises 131

for all D ∈ G, where G is a ∩–stable set of subsets of Ω with σ(G) = D.

Show that Y is a conditional expectation of X given D. ◦

Exercise 4.16. Let X = (X1 , X2 , . . .) be a stochastic process, and assume that Y and Z
are real random variables, such that (Z, Y ) is independent of X. Assume that Y has finite
expectation.

(1) Show that Z Z

E(Y |Z) dP = Y dP
(Z∈B,X∈C) (Z∈B,X∈C)

for all B ∈ B and C ∈ B∞ .

(2) Show that

E(Y |Z) = E(Y |Z, X)

Exercise 4.17. Let X1 , X2 , . . . be independent and identically distributed random variables

with E|X1 | < ∞. Define Sn = X1 + · · · + Xn .

(1) Show that E(X1 |Sn ) = E(X1 |Sn , Sn+1 , Sn+2 , . . .) a.s.
1
(2) Show that n Sn = E(X1 |Sn , Sn+1 , Sn+2 , . . .) a.s.

Exercise 4.18. Assume that (X, Y ) follows the two–dimensional Normal distribution with
mean vector (µ1 , µ2 ) and covariance matrix
" #
σ11 σ12
σ21 σ22

where σ12 = σ21 . Then X ∼ N (µ1 , σ11 ), Y ∼ N (µ2 , σ22 ) and Cov(X, Y ) = σ12 .

Show that
E(X|Y ) = µ1 + β(Y − µ2 ) ,
where β = σ12 /σ22 . ◦
132 Signed measures and conditioning
Chapter 5

Martingales

In this chapter we will present the classical theory of martingales. Martingales are sequences
of real random variables, where the index set N (or [0, ∞)) is regarded as a time line, and
where – conditionally on the present level – the level of the sequence at a future time point
is expected to be as the current level. So it is sequences that evolves over time without a
drift in any direction. Similarly, submartingales are expected to have the same or a higher
level at future time points, conditioned on the present level.

In Section 5.1 we will give an introduction to the theory based on a motivating example from
gambling theory. The basic definitions will be presented in Section 5.2 together with results
on the behaviour of martingales observed at random time points. The following Section 5.3
will mainly address the very important martingale theorem, giving conditions under which
martingales and submartingales converge. In Section 5.4 we shall introduce the concept of
uniform integrability and see how this interplays with martingales. Finally, in Section 5.5
we will prove a central limit theorem for martingales. That is a result that relaxes the
independence assumption from Section 3.5.
134 Martingales

5.1 Introduction to martingale theory

Let Y1 , Y2 , . . . be mutually independent identically distributed random variables with

P (Yn = 1) = 1 − P (Yn = −1) = p

where 0 < p < 1. We will think of Yn as the result of a game where the probability of winning
is p, and where if you bet 1 dollar. and win you receive 1 dollar. and if you lose, you lose
the 1 dollar you bet.

If you bet 1 dollar, then, the expected winnings in each game is

EYn = p − (1 − p) = 2p − 1,

and the game is called favourable if p > 12 , fair if p = 12 and unfavourable if p < 1
2
corresponding to whether EYn is > 0, = or < 0, respectively.

If the player in each game makes a bet of 1, his (signed) winnings after n games will be
Sn = Y1 + · · · + Yn . According to the strong law of large numbers,
1 a.s.
Sn → 2p − 1,
n
so it follows that if the game is favourable, the player is certain to win in the long run (Sn > 0
evt. almost surely) and if the game is unfavourable, the player is certain to lose in the long
run.

Undoubtedly, in practice, it is only possible to participate in unfavourable games (unless you

happen to be, e.g, a casino or a state lottery). Nevertheless, it may perhaps be possible for
a player to turn an unfavourable game into a favourable one, by choosing his bets in a clever
fashion Assume that the player has a starting capital of X0 > 0, where X0 is a constant. Let
Fn = σ(Y1 , . . . , Yn ) for n ≥ 1. A strategy is a sequence (φn )n≥1 of functions

φn : (0, ∞) × {−1, 1}n−1 → [0, ∞),

such that the value of

φn (X0 , y1 , . . . , yn−1 )
is the amount the player will bet in the n’th game, when the starting capital is X0 and
y1 , . . . , yn−1 are the results of the n − 1 first games (the observed values of Y1 , . . . , Yn−1 ).

A strategy thus allows for the player to take the preceding outcomes into account when he
makes his n’th bet.
5.1 Introduction to martingale theory 135

Note that φ1 is given by X0 alone, making it constant. Further note that it is possible to
let φn = 0, corresponding to the player not making a bet, for instance because he or she has
been winning up to this point and therefore wishes to stop.

Given the strategy (φn ) the (signed) winnings in the n’th game become

Zn = Yn φn (X0 , Y1 , . . . , Yn−1 )

and the capital after the n’th game is

n
X
Xn = X0 + Zk .
k=1

It easily follows that for all n, Xn is Fn -measurable, integrable and

E(Xn+1 |Fn ) = E(Xn + Yn+1 φn+1 (X0 , Y1 , . . . , Yn )|Fn )

= Xn + φn+1 (X0 , Y1 , . . . , Yn )E(Yn+1 |Fn )
= Xn + φn+1 (X0 , Y1 , . . . , Yn )(2p − 1).

The most important message here is that

≥ > 1
E(Xn+1 |Fn ) = Xn for p = , (5.1)
≤ < 2
meaning that (Xn , Fn ) is a submartingale (≥), martingale (=) or a supermartingale (≤) (see
Definition 5.2.3 below).

For instance, if p < 21 , the conditional expected value of the capital after the n + 1’st game is
at most Xn , so the game with strategy (φn ) is at best fair. But what if one simply chooses
to focus on the development of the capital at points in time that are advantageous for the
player, and where he or she can just decide to quit?

With 0 < p < 1 infinitely many wins, Yn = 1 and, of course, infinitely many losses, Yn = −1,
will occur with probability 1. Let τk be the time of the k’th win, i.e.

τ1 = inf{n : Yn = 1}

and for k ≥ 1,
τk+1 = inf{n > τk : Yn = 1}.

Each τk can be shown to be a stopping time (see Definition 5.2.6 below) and Theorem 5.2.12
provides conditions for when ’(Xn ) is a supermartingale (fair or unfavourable)’ implies that
136 Martingales

’(Xτk ) is a supermartingale’. The conditions of Theorem 5.2.12 are for instance met if (Xn )
is a supermartingale and we require, not unrealistically, that Xn ≥ a always, where a is some
given constant. (The player has limited credit and any bet made must, even if the player
loses, leave a capital of at least a). It is this result we phrase by stating that it is not possible
to turn an unfavourable game into a favourable one.

Even worse, if p ≤ 21 and we require that Xn ≥ a, it can be shown that if there is a minimum
amount that one must bet (if one chooses to play) and the player keeps playing, he or she
will eventually be ruined! (If p > 21 there will still be a strictly positive probability of ruin,
but it is also possible that the capital will beyond all any number).

The result just stated ’only’ holds under the assumption of, for instance, all Xn ≥ a. As we
shall see, it is in fact easy to specify strategies such that Xτk ↑ ∞ for k ↑ ∞. The problem
is that such strategies may well prove costly in the short run.

A classic strategy is to double the amount you bet every game until you win and then start
all over with a bet of, e.g., 1, i.e.
φ1 (X0 ) = 1,

n−1 (X0 , y1 , . . . , yn−2 ) if yn−1 = −1,
2φ
φn (X0 , y1 , . . . , yn−1 ) =
1 if yn−1 = 1.

Pn−1
If, say, τ1 = n, the player loses k=1 2k−1 in the n − 1 first games and wins 2n−1 in the n’th
game, resulting in the total winnings of
n−1
X
− 2k−1 + 2n−1 = 1.
1

Thus, at the random time τk the total amount won is k and the capital is

Xτk = X0 + k.

But if p is small, one may experience long strings of consecutive losses and Xn can become
very negative.

In the next sections we shall - without referring to gambling - discuss sequences (Xn ) of
random variables for which the inequalities (1.1) hold. A main result is the martingale
convergence theorem (Theorem 5.3.2).

The proof presented here is due to the American probabilist J.L. Doob.
5.2 Martingales and stopping times 137

5.2 Martingales and stopping times

Let (Ω, F, P ) be a probability space and let (Fn )n≥1 be a sequence of sub σ-algebras of F,
which is increasing Fn ⊆ Fn+1 for all n. We say that (Ω, F, Fn , P ) is a filtered probability
space with filtration (Fn )n≥1 . The interpretation of a filtration is that we think of n as a
point in time and Fn as consisting of the events that are decided by what happens up to and
including time n.

Now let (Xn )n≥1 be a sequence of random variables defined on (Ω, F).

Definition 5.2.1. The sequence (Xn ) is adapted to (Fn ) if Xn is Fn -measurable for all n.

Instead of writing that (Xn ) is adapted to (Fn ) we often write that (Xn , Fn ) is adapted.

Example 5.2.2. Assume that (Xn ) is a sequence of random variables defined on (Ω, F, P ),
and define for each n ∈ N the σ–algebra Fn by

Fn = σ(X1 , . . . , Xn ) .

Then (Fn ) is a filtration on (Ω, F, P ) and (Xn , Fn ) is adapted. ◦

Definition 5.2.3. An adapted sequence (Xn , Fn ) of real random variables is called a mar-
tingale if for all n ∈ N it holds that E|Xn | < ∞ and

E(Xn+1 |Fn ) = Xn a.s. ,

and a submartingale if for all n ∈ N it holds that E|Xn | < ∞ and

E(Xn+1 |Fn ) ≥ Xn a.s. ,

and a supermartingale if (−Xn , Fn ) is a submartingale.

Note that if (Xn , Fn ) is a submartingale (martingale), then for all m < n ∈ N

for all D ∈ D.

Corollary 5.2.5. Assume that (Xn Fn ) is adapted with E|Xn | < ∞ for all n ∈ N. Then
(Xn , Fn ) is a submartingale (martingale) if and only if for all n ∈ N
Z Z
Xn+1 dP ≥ Xn dP
F (=) F

for all F ∈ Fn .

When handling martingales and submartingales it is often fruitful to study how they behave
at random time points of a special type called stopping times.

Definition 5.2.6. A stopping time is a random variable τ : Ω → N = N ∪ {∞} such that

(τ = n) ∈ Fn

for all n ∈ N. If τ < ∞ we say that τ is a finite stopping time.

5.2 Martingales and stopping times 139

Example 5.2.7. Let (Xn ) be a sequence of real random variables, and define the filtration
(Fn ) by Fn = σ(X1 , . . . , Xn ). Assume that τ is a stopping time with respect to this filtration,
and consider the set (τ = n) that belongs to Fn . Since Fn is generated by the vector
(X1 , . . . , Xn ) there exists a set B ∈ Bn such that

(τ = n) = (X1 , . . . , Xn ) ∈ Bn .

The implication of this is, that we are able to read off from the values of X1 , . . . , Xn whether
τ = n or not. So by observing the sequence (Xn ) for some time, we know if τ has occurred
or not. ◦

We have an equivalent definition of a stopping time:

Lemma 5.2.8. A random variable τ : Ω → N is a stopping time if and only if

(τ ≤ n) ∈ Fn

for all n ∈ N.

Proof. Firstly, assume that τ is a stopping time. We can write

(τ ≤ n) = ∪nk=1 (τ = k)

which belongs to Fn , since each (τ = k) ∈ Fk and the filtration (Fn ) is increasing, so Fk ⊆ Fn

for k ≤ n.

Assume conversely, that (τ ≤ n) ∈ Fn for all n. Then the stopping time property follows
from
(τ = n) = (τ ≤ n)\(τ ≤ n − 1) ,

since (τ ≤ n) ∈ Fn and (τ ≤ n − 1) ∈ Fn−1 ⊆ Fn .

Example 5.2.9. If n0 ∈ N then the constant function τ = n0 is a stopping time:

(τ = n) ∈ {∅, Ω} ⊆ Fn for all n ∈ N .

If σ and τ are stopping times then σ ∧ τ, σ ∨ τ and σ + τ are also stopping times. E.g. for
σ ∧ τ write
(σ ∧ τ ≤ n) = (σ ≤ n) ∪ (τ ≤ n)

and note that (σ ≤ n), (τ ≤ n) ∈ Fn . ◦

140 Martingales

We now define a σ-algebra Fτ , which consists of all the events that are decided by what
happens up to and including the random time τ .

Consider for τ a stopping time the collection of sets

Fτ = {F ∈ F : F ∩ (τ = n) ∈ Fn for all n ∈ N}.

Then we have

Theorem 5.2.10. (1) Fτ is a σ-algebra.

(2) If σ, τ are stopping times and σ ≤ τ , then Fσ ⊆ Fτ .

Proof. (1) We have

Ω ∩ (τ = n) = (τ = n) ∈ Fn for all n ∈ N ,
since τ is a stopping time. Hence Ω ∈ Fτ . Now assume that F ∈ Fτ . Then by definition
F ∩ (τ = n) ∈ Fn for all n ∈ N, so

F c ∩ (τ = n) = (τ = n) \ F ∩ (τ = n) ∈ Fn

for all n ∈ N. This shows that F c ∈ Fn . Finally assume that F1 , F2 , . . . ∈ Fτ . Then

∞
\ ∞
\

Fk ∩ (τ = n) = Fk ∩ (τ = n) ∈ Fn
k=1 k=1

for all n ∈ N.

Altogether it is shown that Fτ is a σ–algebra.

(2) Assume that F ∈ Fσ . Since σ ≤ τ we can write

n
[
F ∩ (τ = n) = F ∩ (σ = k, τ = n)
k=1
[n

= (F ∩ (σ = k)) ∩ (τ = n) ∈ Fn
k=1

using that for k ≤ n we have Fk ⊆ Fn and F ∩ (σ = k) ∈ Fk . Hence by definition we have

F ∈ Fτ .

With τ a finite stopping time, we consider the process (Xn ) at the random time τ and define

Xτ (ω) = Xτ (ω) (ω).

5.2 Martingales and stopping times 141

From now on we only consider real-valued Xn ’s.

Although definition of Fτ may not seem very obvious, Theorem 5.2.11 below shows that both
Xτ and τ are Fτ –measurable. Hence certain events at time τ are Fτ –measurable, and the
intuitive interpretation of Fτ as consisting of all events up to time τ is still reasonable.

Theorem 5.2.11. If (Xn , Fn ) is adapted and τ is a finite stopping time, then both τ and
Xτ are Fτ -measurable.

Proof. The proof is straightforward manipulations. If we consider the event (τ = k) then we

have (
(τ = n) if k = n
(τ = k) ∩ (τ = n) = ,
∅ if k 6= n

so in both cases we get that (τ = k) ∩ (τ = n) ∈ Fn , and hence (τ = k) ∈ Fτ . This shows

the measurability of τ .

For the second statement, let B ∈ B and realize that for all n,

(Xτ ∈ B) ∩ (τ = n) = (Xn ∈ B) ∩ (τ = n) ∈ Fn ,

implying that (Xτ ∈ B) ∈ Fτ as desired.

Theorem 5.2.12 (Optional sampling, first version). Let (Xn , Fn ) be a submartingale (mar-
tingale) and assume that σ and τ are finite stopping times with σ ≤ τ . If E|Xτ | < ∞,
E|Xσ | < ∞ and
Z
+
lim inf XN dP = 0
N →∞ (τ >N )
Z
(lim inf |XN |dP = 0),
N →∞ (τ >N )

then
E(Xτ |Fσ ) ≥ Xσ .
(=)

Proof. According to Theorem 5.2.4 we only need to show that

Z Z
Xτ dP ≥ Xσ dP (5.3)
A (=) A
142 Martingales

for all A ∈ Fσ . So we fix A in the following and define Dj = A ∩ (σ = j). If we can show
Z Z
Xτ dP ≥ Xσ dP (5.4)
Dj (=) Dj

for all j = 1, 2, . . ., then (5.3) will follow since then

Z ∞ Z
X ∞ Z
X Z
Xτ dP = Xτ dP ≥ Xσ dP = Xσ dP
A Dj (=) Dj A
j=1 j=1

In the two equalities we have used dominated convergence: E.g. for the first equality we have
the integrable upper bound |Xτ |, so
Z ∞
Z X Z M
X
Xτ dP = 1Dj Xτ dP = lim 1Dj Xτ dP
A M →∞
j=1 j=1
M
Z X M Z
X ∞ Z
X
= lim 1Dj Xτ dP = lim 1Dj Xτ dP = Xτ dP
M →∞ M →∞ Dj
j=1 j=1 j=1

Hence the argument will be complete if we can show (5.4). For this, first define for N ≥ j
Z Z
IN = Xτ dP + XN dP .
Dj ∩(τ ≤N ) Dj ∩(τ >N )

We claim that
Ij ≤ Ij+1 ≤ Ij+2 ≤ . . .
(=) (=)

For N > j we get

Z Z
IN = Xτ dP + XN dP
Dj ∩(τ ≤N ) Dj ∩(τ >N )
Z Z Z
= Xτ dP + Xτ dP + XN dP
Dj ∩(τ <N ) Dj ∩(τ =N ) Dj ∩(τ >N )
Z Z Z
= Xτ dP + XN dP + XN dP
Dj ∩(τ <N ) Dj ∩(τ =N ) Dj ∩(τ >N )
Z Z
= Xτ dP + XN dP
Dj ∩(τ <N ) Dj ∩(τ ≥N )

Note that (τ ≥ N ) = (τ ≤ N − 1)c ∈ FN −1 . Also note that Dj = A ∩ (σ = j) ∈ Fj

from the definition of Fσ (since it is assumed that A ∈ Fσ ). Then (recall that j < N )
Dj ∩ (τ ≥ N ) ∈ FN −1 , so Theorem 5.2.4 yields
Z Z
XN dP ≥ XN −1 dP
Dj ∩(τ ≥N ) (=) Dj ∩(τ ≥N )
5.2 Martingales and stopping times 143

So we have
Z Z
IN = Xτ dP + XN dP
Dj ∩(τ <N ) Dj ∩(τ ≥N )
Z Z
≥ Xτ dP + XN −1 dP = IN −1
(=) Dj ∩(τ ≤N −1) Dj ∩(τ >N −1)

and thereby the sequence (IN )N ≥j is shown to be increasing. For the left hand side in (5.4)
this implies that
Z Z Z
Xτ dP = Xτ dP + Xτ dP
Dj Dj ∩(τ ≤N ) Dj ∩(τ >N )
Z Z
+ XN dP − XN dP
Dj ∩(τ >N ) Dj ∩(τ >N )
Z Z
= IN + Xτ dP − XN dP
Dj ∩(τ >N ) Dj ∩(τ >N )
Z Z
≥ Ij + Xτ dP − XN dP
(=) Dj ∩(τ >N ) Dj ∩(τ >N )
Z Z
+
≥ Ij + Xτ dP − XN dP (5.5)
Dj ∩(τ >N ) Dj ∩(τ >N )
Z Z
Ij + Xτ dP − XN dP
Dj ∩(τ >N ) Dj ∩(τ >N )

Recall the assumption σ ≤ τ . Then

Dj ∩ (τ ≤ j) = A ∩ (σ = j) ∩ (τ ≤ j) = Dj ∩ (τ = j) ,

so
Z Z
Ij = Xτ dP + Xj dP
Dj ∩(τ ≤j) Dj ∩(τ >j)
Z Z
= Xj dP + Xj dP
Dj ∩(τ =j) Dj ∩(τ >j)
Z Z Z
= Xj dP = Xj dP = Xσ dP (5.6)
Dj A∩(σ=j) Dj

Hence we have shown (5.4) if we can show that the two last terms in (5.5) can be ignored.
Since (τ > N ) ↓ ∅ for N → ∞ and Xτ is integrable, we have from dominated convergence
that
Z Z Z
lim Xτ dP = lim 1Dj ∩(τ >N ) Xτ dP = lim 1Dj ∩(τ >N ) Xτ dP = 0 (5.7)
N →∞ Dj ∩(τ >N ) N →∞ N →∞

And because of the assumption from the theorem, we must have a subsequence of natural
144 Martingales

numbers N1 , N2 , . . . such that

Z
+
lim XN `
dP = 0
`→∞ (τ >N )
`
Z
lim |XN` | dP = 0
`→∞ (τ >N` )

Hence from using (5.7) we have

Z Z
+
lim Xτ dP − XN `
dP =0
`→∞ Dj ∩(τ >N` ) Dj ∩(τ >N` )

and combining this with (5.5) and (5.6) yields

Z Z
Xτ dP ≥ Ij = Xσ dP
Dj (=) Dj

which is (5.4).

Corollary 5.2.13. Let (Xn , Fn ) be a submartingale (martingale), and let σ ≤ τ be bounded

stopping times. Then E|Xτ | < ∞, E|Xσ | < ∞ and

E(Xτ |Fσ ) ≥ Xσ
(=)

Proof. We show that the conditions from Theorem 5.2.12 are fulfilled. There exists K < ∞
such that supω∈Ω τ (ω) ≤ K. Then
Z K
X K Z
X K Z
X K
X
E|Xτ | = 1(τ =k) Xk dP ≤ 1(τ =k) |Xk | dP ≤ |Xk | dP = E|Xk | < ∞ .
k=1 k=1 k=1 k=1

That E|Xσ | < ∞ follows similarly. Furthermore it must hold that (τ > N ) = ∅ for all
N ≥ K, so obviously Z
+
XN dP = 0
(τ >N )

for all N ≥ K.

We can also translate Theorem 5.2.12 into a result concerning the process considered at a
sequence of stopping times. Firstly, we need to specify the sequence of stopping times.

Definition 5.2.14. A sequence (τn )n≥1 of positive random variables is a sequence of sam-
pling times if it is increasing and each τn is a finite stopping time.
5.3 The martingale convergence theorem 145

With (τn ) a sequence of sampling times it holds, according to 1. in Theorem 5.2.10, that
Fτn ⊆ Fτn+1 for all n. If (Xn , Fn ) is adapted then, according to Theorem 5.2.11, Xτn is
Fτn -measurable. Hence, the sampled sequence (Xτn , Fτn ) is adapted.

Theorem 5.2.15. Let (Xn , Fn ) be a submartingale (martingale) and let (τk ) be a sequence
of sampling times. If
E|Xτk | < ∞ for all k, (a)
Z
+
lim inf XN dP = 0 for all k, (b)
N →∞ (τ >N )
k
Z
(lim inf |XN |dP = 0 for all k),
N →∞ (τk >N )

then (Xτk , Fτk ) is a submartingale (martingale).

Proof. Use Theorem 5.2.12 for each k separately.

5.3 The martingale convergence theorem

We shall need the following result on transformations of martingales and submartingales.

Lemma 5.3.1. Assume that (Xn , Fn ) is an adapted sequence.

(1) If (Xn , Fn ) is a martingale, then both (|Xn |, Fn ) and (Xn+ , Fn ) are submartingales.

(2) If (Xn , Fn ) is a submartingale, then (Xn+ , Fn ) is a submartingale.

Proof. (1) Assume that (Xn , Fn ) is a martingale. Then Xn is Fn –measurable, so also

|Xn | and Xn+ are Fn –measurable. Furthermore we have E(Xn+1 |Fn ) = Xn a.s., so that
E(Xn+1 |Fn )+ = Xn+ a.s. and |E(Xn+1 |Fn )| = |Xn | a.s. We also have EXn+ < ∞ (recall
that 0 ≤ Xn+ ≤ |Xn |) and obviously E|Xn | < ∞.

Then
+
E(Xn+1 |Fn ) ≥ E(Xn+1 |Fn )+ = Xn+ a.s. ,
where the inequality follows from (9) in Theorem 4.2.6, since the function x 7→ x+ is convex.
Similarly since x 7→ |x| is convex, (9) in Theorem 4.2.6 gives that

E(|Xn+1 ||Fn ) ≥ |E(Xn+1 |Fn )| = |Xn | a.s.

146 Martingales

which proves (1).

(2) If (Xn , Fn ) is a submartingale, we similarly have that Xn+ is Fn –measurable with

EXn+ < ∞. Furthermore it holds

E(Xn+1 |Fn ) ≥ Xn a.s. ,

and since x 7→ x+ is increasing, we also have

E(Xn+1 |Fn )+ ≥ Xn+ a.s. .

We obtain
+
E(Xn+1 |Fn ) ≥ E(Xn+1 |Fn )+ ≥ Xn+ a.s.

We shall now prove the main theorem of classic martingale theory.

Theorem 5.3.2 (The martingale convergence theorem). If (Xn , Fn ) is a submartingale such

that sup EXn+ < ∞, then X = lim Xn exists almost surely and E|X| < ∞.
n n→∞

The proof is given below. Note that, cf. Lemma 5.3.1 the sequence EXn+ is increasing, so
the assumption sup EXn+ < ∞ is equivalent to assuming
n

lim EXn+ < ∞.

n→∞

The proof is based on a criterion for convergence of a sequence of real numbers, which we
shall now discuss.

Let (xn )n≥1 be a sequence of real numbers. For a < b consider

n1 = inf{n ≥ 1 | xn ≥ b}, m1 = inf{n > n1 | xn ≤ a}

and recursively define

nk = inf{n > mk−1 | xn ≥ b}, mk = inf{n > nk | xn ≤ a},

always using the convention inf ∅ = ∞. We now define the number of down-crossings from
b to a for the sequence (xn ) as +∞ if all mk < ∞ and as k if mk < ∞ and mk+1 = ∞ (in
particular, 0 down-crossings in the case m1 = ∞). Note that n1 ≤ m1 ≤ n2 ≤ m2 . . . with
equality only possible, if the common value is ∞.
5.3 The martingale convergence theorem 147

Lemma 5.3.3. The limit x = lim xn exists (as a limit in R = [−∞, ∞]) if and only if for
n→∞
all rational a < b it holds that the number of down-crossings from b to a for (xn ) is finite.

Proof. Firstly, consider the ”only if” implication. So assume that the limit x = limn→∞
exists in R and let a < b be given. We must have that either x > a or x < b (if not both of
them are true). If for instance x > a then we must have some n0 such that

xn > a for all n ≥ n0 ,

since e.g there exists n0 with |x − xn | < x − a for all n ≥ n0 . But then we must have that
mk = ∞ for k large enough, which makes the number of down–crossings finite. The case
x < b is treated analogously.

Now we consider the ”if” part of the result, so assume that the limit limn→∞ xn does not
exist. Then lim inf n→∞ xn < lim supn→∞ xn , so in particular we can find rational numbers
a < b such that
lim inf xn < a < b < lim sup xn .
n→∞ n→∞
This implies that xn ≤ a for infinitely many n and xn ≥ b for infinitely many n. Especially
we must have that the number of down–crossings from b to a is infinite.

In the following proofs we shall apply the result to a sequence (Xn ) of real random variables.
For this we will need some notation similar to the definition of nk and mk above. Define the
random times
σ1 = inf{n | Xn ≥ b}, τ1 = inf{n > σ1 | Xn ≤ a},
and recursively

σk = inf{n > τk−1 | Xn ≥ b}, τk = inf{n > σk | Xn ≤ a}.

We have that σ1 ≤ τ1 ≤ σ2 ≤ τ2 ≤ . . . , where equality is only possible if the common value

is ∞.

Furthermore, all τk and σk are stopping times: We can write

n−1
[
(σ1 = n) = (Xi < b) ∩ (Xn ≥ b)
i=1

which is in Fn , since all X1 , . . . , Xn are Fn –measurable, implying that (Xi > a) ∈ Fn (when
i < n) and (Xn ≤ a) ∈ Fn . Furthermore if e.g. σk is a stopping time, then
n−1
\
(τk = n) = (σk = j) ∩ (Xj+1 > a, . . . , Xn−1 > a, Xn ≤ a)
j=1
148 Martingales

which belongs to Fn , since (σk = j) ∈ Fn and all the X–variables involved are Fn –
measurable. Hence τk is a stopping time as well.

Let βab (ω) be the number of down–crossings from b to a for the sequence (Xn (ω)) for each
ω ∈ Ω. Then we have (we suppress the ω)
∞
X
βab = 1(τk <∞)
k=1

so we see that βab is an integer–valued random variable. With this notation, we can also
N
define the number of down–crossings βab from b to a in the time interval {1, . . . , N } as
∞
X N
X
N
βab = 1(τk ≤N ) = 1(τk ≤N ) .
k=1 k=1

The last equality follows since we necessarily must have τN +1 ≥ N + 1 (either τN +1 = ∞ or

the strict inequality holds, making σ1 < τ1 < · · · < σN +1 < τN +1 ).

N
Finally note, that βab ↑ βab as N → ∞.

Proof of Theorem 5.3.2. In order to show that X = lim Xn exists as a limit in R it is,
according to Lemma 5.3.3, sufficient to show that

P (βab is finite for all rational a < b) = 1 ,

But it follows directly from the down–crossing lemma (Lemma 5.3.4), that for all rational
pairs a < b we have P (βab < ∞) = 1. Hence also
\
1=P (βab < ∞) = P (βab is finite for all rational a < b) .
a<b rational

We still need to show that E|X| < ∞. In the affirmative, we will also know that X is finite
almost surely. Otherwise we would have

P (|X| = ∞) = > 0

such that
E|X| ≥ E(1(|X|=∞) · ∞) = ∞ · = ∞
which is a contradiction.

So we prove that E|X| < ∞. Fatou’s Lemma yields

E|X| = E( lim |Xn |) = E(lim inf |Xn |) ≤ lim inf E|Xn |

n→∞ n→∞ n→∞
5.3 The martingale convergence theorem 149

Note that since (Xn , Fn ) is a submartingale, then

EX1 ≤ EXn = EXn+ − EXn− .

which implies that EXn− ≤ EXn+ − EX1 such that

E|Xn | = EXn+ + EXn− ≤ 2EXn+ − EX1 .

Then we obtain that

E|X| ≤ lim inf E|Xn | ≤ 2 lim inf EXn+ − EX1 = 2 lim EXn+ − EX1 < ∞ ,
n→∞ n→∞ n→∞

due to the assumption in Theorem 5.3.2.

The most significant and most difficult part of the proof of Theorem 5.3.2 is contained in the
next result.

Lemma 5.3.4 (The down–crossing lemma). If (Xn , Fn ) is a submartingale then it holds,

for all N ∈ N, a < b, that
N E(XN − b)+
Eβab ≤ , (5.8)
b−a
1
Eβab ≤ lim E(Xn − b)+ . (5.9)
b − a n→∞
In particular, βab < ∞ a.s. if limn→∞ EXn+ < ∞.

Proof. The last claim follows directly from (5.9) and the inequality (Xn − b)+ ≤ Xn+ + |b|,
since it is assumed in the theorem that sup EXn+ = lim EXn+ < ∞.

Note that for all N, k ∈ N it holds that 1(σk ≤N ) = 1(τk ≤N ) + 1(σk ≤N <τk ) . Then we can write

N
X N
X
(Xτk ∧N − Xσk ∧N )1(σk ≤N ) = (Xτk ∧N − Xσk ∧N )(1(τk ≤N ) + 1(σk ≤N <τk ) )
k=1 k=1
N
X N
X
= (Xτk − Xσk )1(τk ≤N ) + (XN − Xσk )1(σk ≤N <τk )
k=1 k=1
N
X N
X
≤ (a − b) 1(τk ≤N ) + (XN − b) 1(σk ≤N <τk )
k=1 k=1
N
≤ (a − b)βab + (XN − b)+ .
150 Martingales

In the first inequality we have used, that if τk < ∞, then Xσk ≥ b and Xτk ≤ a, and also if
only σk < ∞ it holds that Xσk ≥ b. By rearranging the terms above, we obtain the inequality
N
N (XN − b)+ 1 X
βab ≤ − (Xτk ∧N − Xσk ∧N )1(σk ≤N ) . (5.10)
b−a b−a
k=1

for all F ∈ Fσk ∧N . Since

(
(σk = n) n ≤ N
(σk ≤ N ) ∩ (σk ∧ N = n) = ∈ Fn
∅ n>N

for all n ∈ N, we must have that (σk ≤ N ) ∈ Fσk ∧N which implies that
Z Z
E (Xτk ∧N − Xσk ∧N )1(σk ≤N ) = Xτk ∧N dP − Xσk ∧N dP ≥ 0 .
(σk ≤N ) (σk ≤N )

Then the sum in (5.10) has positive expectation, so in particular we have

N E(XN − b)+
Eβab ≤
b−a
Finally note that (Xn − b, Fn ) is a submartingale, since (Xn , Fn ) is a submartingale. Hence
the sequence ((Xn − b)+ , Fn ) will be a submartingale as well according to 2. in Lemma
5.3.1, such that E(XN − b)+ is increasing and thereby convergent. So applying monotone
N
convergence to the left hand side (recall βab ↑ βab ) leads to
1
Eβab ≤ lim E(Xn − b)+ .
b − a n→∞

It is useful to highlight the following immediate consequences of Theorem 5.3.2.

Corollary 5.3.5. 1. If (Xn , Fn ) is a submartingale and there exists c ∈ R such that Xn ≤ c

a.s. for all n, then X = lim Xn exists almost surely and E|X| < ∞.

2. If (Xn , Fn ) is a martingale and there exists c ∈ R such that either Xn ≤ c a.s. for all n
or Xn ≥ c a.s. for all n, then X = lim Xn exists almost surely and E|X| < ∞.
5.4 Martingales and uniform integrability 151

5.4 Martingales and uniform integrability

a.s.
If (Xn , Fn ) is a martingale then in particular EXn = EX1 for all n, but even though Xn → X
and E|X| < ∞, it is not necessarily true that EX = EX1 . We shall later obtain some results
where not only EX = EX1 but where in addition, the martingale (Xn , Fn ) has a number of
other attractive properties. Let (Xn )n≥1 , X be real random variables with E|Xn | < ∞, ∀n
L1
and E|X| < ∞. Recall that by Xn −→ X we mean that

lim E|Xn − X| = 0,
n→∞

and if this is the case then

Definition 5.4.1. A family (Xi )i∈I of real random variables is uniformly integrable if
E|Xi | < ∞ for all i ∈ I and
Z
lim sup |Xi |dP = 0.
x→∞ I (|Xi |>x)

Example 5.4.2. (1) The family {X} consisting of only one real variable is uniformly inte-
grable if E|X| < ∞:
Z Z
lim |X| dP = lim 1(|X|>x) |X| dP = 0
x→∞ (|X|>x) x→∞

by dominated convergence (since 1(|X|>x) |X| ≤ |X|, and |X| is integrable).

(2) Now consider a finite family (Xi )i=1,...,n of real random variables. This family is uniformly
integrable, if each E|Xi | < ∞:
Z Z
lim sup |Xi | dP = lim max |Xi | dP = 0 ,
x→∞ i∈I (|Xi |>x) x→∞ i=1,...,n (|Xi |>x)

since each integral has limit 0 because of (1). ◦

152 Martingales

Example 5.4.3. Let (Xi )i∈I be a family of real random variables. If supi∈I E|Xi |r < ∞ for
some r > 1, then (Xi ) is uniformly integrable:
Z Z r−1
Z
Xi 1
|Xi | dP ≤ |Xi | dP = |Xi |r dP
(|Xi |>x) (|Xi |>x) x xr−1 (|Xi |>x)
1 1
≤ E|Xi |r ≤ sup E|Xj |r
xr−1 xr−1 j∈I

so we obtain Z
1
sup |Xi | dP ≤ sup E|Xj |r
i∈I (|Xi |>x) xr−1 j∈I

which has limit 0 as x → ∞. ◦

We have the following equivalent definition of uniform integrability:

Theorem 5.4.4. The family (Xi )i∈I is uniformly integrable if and only if,

(1) supE|Xi | < ∞,

(2) there for all > 0 exists a δ > 0 such that

Z
sup |Xi |dP <
i∈I A

for all A ∈ F with P (A) < δ.

Proof. First we demonstrate the ”only if” statement. So assume that (Xi ) is uniformly
integrable. For all x > 0 we have for all i ∈ I that
Z Z Z
E|Xi | = |Xi | dP = |Xi | dP + |Xi | dP
Ω (|Xi |≤x) (|Xi |>x)
Z Z
≤ xP (|Xi | ≤ x) + |Xi | dP ≤ x + |Xi | dP
(|Xi |>x) (|Xi |>x)

so also Z
sup E|Xi | ≤ x + sup |Xi | dP .
i∈I i∈I (|Xi |>x)

The last term is assumed to → 0 as x → ∞. In particular it is finite for x sufficiently large,

so we conclude that supi∈I E|Xi | is finite, which is (1).
5.4 Martingales and uniform integrability 153

To show (2) let > 0 be given. Then for all A ∈ F we have (with a similar argument to the
one above)
Z Z Z
|Xi | dP = |Xi | dP + |Xi | dP
A A∩(|Xi |≤x) A∩(|Xi |>x)
Z
≤ xP (A ∩ (|Xi | ≤ x)) + |Xi | dP
A∩(|Xi |>x)
Z
≤ xP (A) + |Xi | dP
(|Xi |>x)

so Z Z
sup |Xi | dP ≤ xP (A) + sup |Xi | dP .
i∈I A i∈I (|Xi |>x)

Now choose x = x0 > 0 according to the assumption of uniform integrability such that
Z

sup |Xi | dP < .
i∈I (|Xi |>x0 ) 2

Then for A ∈ F with P (A) < 2x0we must have
Z

sup |Xi | dP < x0 + =
i∈I A 2x0 2

so if we choose δ = 2x0 we have shown (2).

For the ”if” part of the theorem, assume that both (1) and (2) hold. Assume furthermore
that it is shown that
lim sup P (|Xi | > x) = 0 . (5.11)
x→∞ i∈I

In order to obtain that the definition of uniform integrability is fulfilled, let > 0 be given.
We want to find x0 > 0 such that
Z
sup |Xi | dP ≤
i∈I (|Xi |>x)

for x > x0 . Find the δ > 0 corresponding to according to (2) and then let x0 satisfy that

sup P (|Xi | > x) < δ

i∈I

for x > x0 . Now for all x > x0 and i ∈ I we have P (|Xi | > x) < δ such that (because of (2))
Z Z
|Xi | dP ≤ sup |Xj | dP ≤
(|Xi |>x) j∈I (|Xi |>x)

so also Z
sup |Xi | dP ≤ .
i∈I (|Xi |>x)
154 Martingales

Hence the proof is complete, if we can show (5.11). But for x > 0 it is obtained from Markov’s
inequality that
1
sup P (|Xi | > x) ≤ sup E|Xi |
i∈I x i∈I
and the last term → 0 as x → ∞, since supi∈I E|Xi | is finite by the assumption (1).

The next result demonstrates the importance of uniform integrability if one aims to show L1
convergence.

Theorem 5.4.5. Let (Xn )n≥1 , X be real random variables with E|Xn | < ∞ for all n. Then
L1 P
E|X| < ∞ and Xn −→ X if and only if (Xn )n≥1 is uniformly integrable and Xn → X.

L1 P
Proof. Assume that E|X| < ∞ and Xn −→ X. Then in particular Xn −→ X.

We will show that (Xn ) is uniformly integrable by showing (1) and (2) from Theorem 5.4.4
are satisfied.

That (1) is satisfied follows from the fact that

E|Xn | = E|Xn − X + X| ≤ E|Xn − X| + E|X|

so
sup E|Xn | ≤ E|X| + sup E|Xn − X| .
n≥1 n≥1

Since each E|Xn − X| ≤ E|Xn | + E|X| < ∞ and the sequence (E|Xn − X|)n∈N converges to
0 (according to the L1 –convergence), it must be bounded, so supn≥1 E|Xn − X| < ∞.

To show (2) first note that for A ∈ F

Z Z Z Z Z
|Xn | dP = |Xn − X + X| dP ≤ |Xn − X| dP + |X| dP ≤ E|Xn − X| + |X| dP .
A A A A A

Now let > 0 be given and choose n0 so that E|Xn − X| < 2 for n > n0 . Furthermore (since
the one–member family {X} is uniformly integrable) we can choose δ1 > 0 such that
Z

|X| dP < if P (A) < δ1 .
A 2
Then we have Z Z
|Xn | dP ≤ E|Xn − X| + |X| dP <
A A
whenever n > n0 and P (A) < δ1 .
5.4 Martingales and uniform integrability 155

Now choose δ2 > 0 (since the finite family (Xn )1≤n≤n0 is uniformly integrable) such that
Z
max |Xn | dP <
1≤n≤n0 A
R
if P (A) < δ2 . We have altogether with δ = δ1 ∧δ2 that for all n ∈ N it holds that A
|Xn | dP <
if P (A) < δ, and this shows (2) since then
Z
sup |Xn | dP ≤ if P (A) < δ .
n≥1 A

P
For the converse implication, assume that (Xn )n≥1 is uniformly integrable and Xn −→ X.
a.s.
Then we can choose (according to Theorem 1.2.13) a subsequence (nk )k≥1 such that Xnk −→
X. From Fatou’s Lemma and the fact that (1) is satisfied we obtain

E|X| = E lim |Xnk | = E lim inf |Xnk | ≤ lim inf E|Xnk | ≤ sup E|Xnk | ≤ sup E|Xn | < ∞ .
k→∞ k→∞ k→∞ k≥1 n≥1

In order to show that E|Xn − X| → 0 let > 0 be given. We obtain for all n ∈ N
Z Z
E|Xn − X| = |Xn − X| dP + |Xn − X| dP
(|Xn −X|≤ 2 ) (|Xn −X|> 2 )
Z

≤ + |Xn − X| dP
2 (|Xn −X|> 2 )
Z Z

≤ + |Xn | dP + |X| dP
2 (|Xn −X|> 2 ) (|Xn −X|> 2 )

In accordance with (2) applied to the two families (Xn ) and (X) choose δ > 0 such that
Z Z

sup |Xm | dP < and |X| dP <
m∈N A 4 A 4

P
for all A ∈ F with P (A) < δ. Since Xn −→ X we can find n0 ∈ N such that P (|Xn − X| >

2 ) < δ for n ≥ n0 . Then for n ≥ n0 we have
Z Z

|Xn | dP ≤ sup |Xm | dP <

(|Xn −X|> 2 )
m≥1 (|Xn −X|> 2 ) 4
Z

|X| dP <
(|Xn −X|> 2 ) 4

so we have shown that for n ≥ n0 it holds

E|Xn − X| < + + =
2 4 4
which completes the proof.
156 Martingales

After this digression into the standard theory of integration we now return to adapted se-
quences, martingales and submartingales.

Definition 5.4.6. If (Xn , Fn ) is a submartingale (martingale) and Y is a real random

variable with E|Y | < ∞, we say that Y closes the submartingale (martingale) if

E(Y |Fn ) ≥ Xn , (E(Y |Fn ) = Xn ) a.s.

for all n ≥ 1.

Theorem 5.4.7. (1) Let (Xn , Fn ) be a submartingale. If (Xn+ )n≥1 is uniformly integrable,
L1
then X = lim Xn exists almost surely, X closes the the submartingale and Xn+ −→ X + .
n→∞

A sufficient condition for uniform integrability of (Xn+ ) is the existence of a random variable
Y that closes the submartingale.

(2) Let (Xn , Fn ) be a martingale. If (Xn ) is uniformly integrable, then X = lim Xn exists
n→∞
L1
almost surely, X closes the martingale and Xn −→ X.

A sufficient condition for uniform integrability of (Xn ) is the existence of a random variable
Y that closes the martingale.

Proof. (1) We start with the final claim, so assume that there exists Y such that

E(Y |Fn ) ≥ Xn a.s.

for all n ∈ N. Then from (9) in Theorem 4.2.6 we obtain

Xn+ ≤ (E(Y |Fn ))+ ≤ E(Y + |Fn ) a.s. (5.12)

and taking expectations on both sides yields EXn+ ≤ EY + so also supn EXn+ ≤ EY + < ∞.
Then according to the martingale convergence theorem (Theorem 5.3.2) we have that X =
limn→∞ Xn exists almost surely. Since (Xn (ω))n→∞ is convergent for almost all ω ∈ Ω it is
especially bounded (not by the same constant for each ω), and in particular supn Xn+ (ω) < ∞
for almost all ω. We obtain that for all x > 0 and n ∈ N
Z Z Z Z
Xn+ dP ≤ E(Y + |Fn ) dP = Y + dP ≤ Y + dP ,
+ + +
(Xn >x) (Xn >x) (Xn >x) (supk Xk+ >x)
5.4 Martingales and uniform integrability 157

where the first inequality is due to (5.12) and the equality follows from the definition of
conditional expectations, since Xn is Fn –measurable such that (Xn+ > x) ∈ Fn . Since this
is true for all n ∈ N we have
Z Z
sup Xn+ dP ≤ Y + dP
+ +
n∈N (Xn >x) (supn Xn >x)

Furthermore we have that Y + is integrable with 1(supn Xn+ >x) Y + ≤ Y + for all x, and obviously
since supn Xn+ < ∞ almost surely, we have 1(sup Xn+ >x) Y + → 0 a.s. when x → ∞. Then
from Dominated convergence the right hand integral above will → 0 as x → ∞. Hence we
have shown that (Xn+ ) is uniformly integrable.

Now we return to the first statement, so assume that (Xn+ ) is uniformly integrable. In
particular (according to 1 in Theorem 5.4.4) we have supn EXn+ < ∞. Then The Martingale
convergence Theorem yields that X = limn→∞ Xn exists almost surely with E|X| < ∞.
a.s. L1
Obviously we must also have Xn+ −→ X + , and then Theorem 5.4.5 implies that Xn+ −→ X + .

In order to show that

E(X|Fn ) ≥ Xn a.s.

for all n ∈ N it is equivalent (according to Lemma 5.2.4) to show that for all n ∈ N it holds
Z Z
Xn dP ≤ X dP (5.13)
F F

for all F ∈ Fn . For F ∈ Fn and n ≤ N we have (according to Corollary 5.2.5, since (Xn , Fn )
is a submartingale)
Z Z Z Z
+ −
Xn dP ≤ XN dP = XN dP − XN dP .
F F F F

+ L1
Since it is shown that XN −→ X + we have from the remark just before Definition 5.4.1 that
Z Z
+
lim XN dP = X + dP
N →∞ F F

− a.s. a.s.
Furthermore Fatou’s lemma yields (when using that XN −→ X so XN −→ X − and thereby
−
lim inf N →∞ XN = X − a.s.)

Z Z Z Z
− − −
lim sup − XN dP = − lim inf XN dP ≤ − lim inf XN dP = − X − dP .
N →∞ F N →∞ F F N →∞ F
158 Martingales

When combining the obtained inequalities we have for all n ∈ N and F ∈ Fn that
Z Z Z
+ −
Xn dP ≤ lim sup XN dP − XN dP
F N →∞ F F
Z Z
+ −
= lim XN dP + lim sup − XN dP
N →∞ F N →∞ F
Z Z Z
+ −
≤ X dP − X dP = X dP
F F F

which is the inequality (5.13) we were supposed to show.

(2) Once again we start proving the last claim, so assume that Y closes the martingale, so
E(Y |Fn ) = Xn a.s. for all n ∈ N. From (10) in Theorem 4.2.6 we have

|Xn | = |E(Y |Fn )| ≤ E(|Y ||Fn ) a.s. .

Similar to the argument in 1 we then have supn≥1 E|Xn | ≤ E|Y | < ∞ so in particular
supn≥1 EXn+ < ∞. Hence X = limn→∞ Xn exists almost surely leading to the fact that
supn≥1 |Xn | < ∞ almost surely. Then for all n ∈ N and x > 0
Z Z Z Z
|Xn | dP ≤ E(|Y ||Fn ) dP = |Y | dP ≤ |Y | dP ,
(|Xn |>x) (|Xn |>x) (|Xn |>x) (supk |Xk |>x)

where the last integral → 0 as x → ∞ as a result of dominated convergence. Hence

Z Z
0 ≤ lim sup |Xn | dP ≤ lim |Y | dP = 0
x→∞ n∈N (|Xn |>x) x→∞ (supk |Xk |>x)

so (Xn ) is uniformly integrable.

Finally assume that (Xn ) is uniformly integrable. Then supn E|Xn | < ∞ (from Theorem
5.4.4) and in particular supn EXn+ < ∞. According to the martingale convergence theorem
a.s. L1
we have Xn −→ X with E|X| < ∞. From Theorem 5.4.5 we have Xn −→ X which leads to
(see the remark just before Definition 5.4.1)
Z Z
lim XN dP = X dP .
N →∞ F F

for all F ∈ F. We also have from the martingale property of (Xn , Fn ) that for all n ≤ N
and F ∈ Fn Z Z
Xn dP = XN dP
F F
so we must have Z Z
Xn dP = X dP
F F
for all n ∈ N and F ∈ Fn . This shows that E(X|Fn ) = Xn a.s., so X closes the martingale.
5.4 Martingales and uniform integrability 159

An important warning is the following: Let (Xn , Fn ) be a submartingale and assume that
a.s. L1
(Xn+ )n≥1 is uniformly integrable. As we have seen, we then have Xn −→ X and Xn+ −→ X + ,
L1
but we do not in general have Xn −→ X. If, e.g., (Xn− )n≥1 is also uniformly integrable, then
L1 a.s. L1
it is true that Xn −→ X since Xn− −→ X − implies that Xn− −→ X − by Theorem 5.4.5 and
then E|Xn − X| ≤ E|Xn+ − X + | + E|Xn− − X − | → 0.

Theorem 5.4.8. Let (Fn )n∈N be a filtration on (Ω, F, P ) and let Y be a random variable
with E|Y | < ∞. Define for all n ∈ N

Xn = E(Y |Fn ) .

Then (Xn , Fn ) is a martingale, and Xn → X a.s. and in L1 , where

X = E(Y |F∞ )

with F∞ ⊆ F the smallest σ–algebra containing Fn for all n ∈ N.

Proof. Clearly, (Xn , Fn ) is adapted and E|Xn | < ∞ and since

E(Xn+1 |Fn ) = E(E(Y |Fn+1 )|Fn ) = E(Y |Fn ) = Xn ,

we see that (Xn , Fn ) is a martingale. It follows directly from the definition of Xn that Y
closes the martingale, so by 2 in Theorem 5.4.7 it holds that X = lim Xn exists a.s. and that
L1
Xn −→ X.

The remaining part of the proof is to show that

X = E(Y |F∞ ) . (5.14)

For this, first note that X = limn→∞ Xn can be chosen F∞ –measurable, since
∞ [
∞ ∞
\ \ 1
F = ( lim Xn exists) = |Xn − Xm | ≤ ∈ F∞ ,
n→∞ N
N =1 k=1 m,n=k

and X can be defined as

X = lim Xn 1F ,
n→∞

where each Xn 1F is F∞ –measurable, making X measurable with respect to F∞ as well.

Hence, in order to show (5.14), we only need to show that
Z Z
X dP = Y dP
F F
160 Martingales

for all F ∈ F∞ . Note that F∞ = σ ∪∞ ∞

n=1 Fn where ∪n=1 Fn is ∩–stable. Then according
to Exercise 4.15 it is enough to show the equality above for all F ∈ Fn for any given n ∈ N.
So let n ∈ N be given and assume that F ∈ Fn . Then we have for all N ≥ n that
Z Z
XN dP = Y dP
F F

L1
since XN = E(Y |FN ) and F ∈ Fn ⊆ FN . Furthermore we have XN −→ X, so
Z Z
X dP = lim XN dP
F N →∞ F

which leads to the conclusion that

Z Z
X dP = Y dP
F F

In conclusion of this chapter, we shall discuss an extension of the optional sampling theorem
(Theorem 5.2.12).

Let (Xn , Fn )n≥1 be a (usual) submartingale or martingale, and let τ be an arbitrary stopping
time. If lim Xn = X exists a.s., we can define Xτ even if τ is not finite as is assumed in
Theorem 5.2.12: 
X
τ (ω) (ω) if τ (ω) < ∞
Xτ (ω) =
X(ω) if τ (ω) = ∞.
With this, Xτ is a Fτ -measurable random variable which is only defined if (Xn ) converges
a.s.

If σ ≤ τ are two stopping times, we are interested in investigating when

E(Xτ |Fσ ) ≥ Xσ , (5.15)

if (Xn , Fn ) is a submartingale, and when

E(Xτ |Fσ ) = Xσ , (5.16)

if (Xn , Fn ) is a martingale.

By choosing τ = ∞, σ = n for a given, arbitrary, n ∈ N, the two conditions amount to the

demand that X closes the submartingale, respectively the martingale. So if (5.15), (5.16) is
to hold for all pairs σ ≤ τ of stopping times, we must consider submartingales or martingales,
which satisfy the conditions from Theorem 5.4.7. For pairs with τ < ∞ we can use Theorem
5.2.12 to obtain weaker conditions.
5.4 Martingales and uniform integrability 161

Theorem 5.4.9 (Optional sampling, full version). Let (Xn , Fn ) be a submartingale (mar-
tingale) and let σ ≤ τ be stopping times. If one of the three following conditions is satisfied,
then E|Xτ | < ∞ and
E(Xτ |Fσ ) ≥ Xσ .
(=)

(1) τ is bounded.
(2) τ < ∞, E|Xτ | < ∞ and Z
+
lim inf XN dP = 0
N →∞ (τ >N )
Z
(lim inf |XN |dP = 0).
N →∞ (τ >N )

(3) (Xn+ )n≥1 is uniformly integrable ((Xn )n≥1 is uniformly integrable).

If 3 holds, then lim Xn = X exists a.s. with E|X| < ∞ and for an arbitrary stopping time τ
it holds that
E|Xτ | ≤ 2EX + − EX1 . (5.17)

Proof. That the conclusion is true under assumption (1) is simply Corollary 5.2.13. Com-
paring condition (2) with the conditions in Theorem 5.2.12 shows, that if (2) implies that
E|Xσ | < ∞, then the argument concerning condition (2) is complete. For this consider the
increasing sequence of bounded stopping times given by (σ ∧ n)n≥1 . For each n the pair of
stopping times σ ∧ n ≤ σ ∧ (n + 1) fulfils the conditions of Corollary 5.2.13, so E|Xσ∧n | < ∞,
E|Xσ∧(n+1) | < ∞, and
E(Xσ∧(n+1) |Fσ∧n ) ≥ Xσ∧n a.s. ,
(=)

which shows that the adapted sequence (Xσ∧n , Fσ∧n ) is a submartingale (martingale). We
have similarly that the pair of stopping times σ ∧ n ≤ τ fulfils the conditions from Theo-
rem 5.2.12 (because of the assumption (2) and that E|Xσ∧n | < ∞ as argued above). Hence
the theorem yields that for each n ∈ N we have

E(Xτ |Fσ∧n ) ≥ Xσ∧n a.s. ,

(=)

which shows that Xτ closes the submartingale (martingale) (Xσ∧n , Fσ∧n ). Hence according
to Theorem 5.4.7 it converges almost surely, where the limit is integrable. Since obviously,
a.s.
Xσ∧n −→ Xσ we conclude that E|Xσ | < ∞ as desired.

We finally show that (3) implies (5.15) if (Xn , Fn ) is a submartingale. That (3) implies
(5.16) if (Xn , Fn ) is a martingale, is then seen as follows: From fact that (|Xn |) is uniformly
162 Martingales

integrable we have that both (Xn+ ) and (Xn− ) are uniformly integrable. Since both (Xn , Fn )
and (−Xn , Fn ) are submartingales (with (Xn+ ) and ((−Xn )+ ) uniformly integrable) the result
for submartingales can be applied to obtain

E(Xτ |Fσ ) ≥ Xσ and E(Xτ |Fσ ) ≤ Xσ

from which the desired result can be derived.

So, assume that (Xn , Fn ) is a submartingale and that (Xn+ ) is uniformly integrable. From (1)
in Theorem 5.4.7 we have that X = limn→∞ Xn exists almost surely and that X closes the
submartingale (Xn , Fn ). Since (Xn+ , Fn ) is a submartingale as well we can apply Theorem
5.4.7 again and obtain that X + closes the submartingale (Xn+ , Fn ). Using the notation
+
X∞ = X + we get
Z X X Z
+ +
EXτ = 1(τ =n) Xn dP = Xn+ dP
1≤n≤∞ 1≤n≤∞ (τ =n)
X Z Z X Z
+ +
≤ X dP = 1(τ =n) X dP = X + dP = EX +
1≤n≤∞ (τ =n) 1≤n≤∞

at the inequality we have used Lemma 5.2.4 and (τ = n) ∈ Fn since E(X + |Fn ) ≥ Xn+ . Let
N ∈ N. Then τ ∧ N is a bounded stopping time with τ ∧ N ↑ τ as N → ∞. Applying the
inequality above to τ ∧ N yields
EXτ+∧N ≤ EX + . (5.18)
Furthermore we have according to part (1) in the theorem (since 1 ≤ τ ∧ N are bounded
stopping times), that
EX1 ≤ EXτ ∧N . (5.19)
Then combining (5.18) and (5.19) gives

EXτ−∧N = EXτ+∧N − EXτ ∧N ≤ EX + − EX1

such that by Fatou’s lemma it holds that

EXτ− = E lim inf Xτ−∧N ≤ lim inf EXτ−∧N ≤ EX + − EX1

N →∞ N →∞

Thereby we have
E|Xτ | = EXτ+ + EXτ− ≤ 2EX + − EX1
which in particular is finite.

The proof will be complete, if we can show

E(Xτ |Fσ ) ≥ Xσ
5.4 Martingales and uniform integrability 163

which is equivalent to showing

Z Z
Xσ dP ≤ Xτ dP
F F

for all F ∈ Fσ . For showing this inequality it will suffice to show

Z Z
Xk dP ≤ Xτ dP (5.20)
F ∩(σ=k) F ∩(σ=k)

for all F ∈ Fσ and k ∈ N since in that case we can obtain (using dominated convergence,
since E|Xτ | < ∞ and E|Xσ | < ∞)
Z X Z X Z Z
Xσ dP = Xk dP ≤ Xτ dP = Xτ dP
F 1≤k≤∞ F ∩(σ=k) 1≤k≤∞ F ∩(σ=k) F

and we obviously have

Z Z
X∞ dP = Xτ dP .
F ∩(σ=∞) F ∩(σ=∞)

So we will show the inequality (5.20). The Theorem in the 1–case yields (since σ ∧ N ≤ τ ∧ N
are bounded stopping times) that E(Xτ ∧N |Fσ∧N ) ≥ Xσ∧N a.s. Now let Fk = F ∩ (σ = k)
and assume that N ≥ k. Then Fk ∈ Fσ∧N :
(
F ∩ (σ = n) ∈ Fn n = k
Fk ∩ (σ ∧ N = n) =
∅ ∈ Fn n 6= k

From this we obtain

Z Z Z
Xk dP = Xσ∧N dP ≤ Xτ ∧N dP
Fk Fk Fk
Z Z
= Xτ dP + XN dP .
Fk ∩(τ ≤N ) Fk ∩(τ >N )

We will spend the rest of the proof on finding an upper limit of the two terms, as N → ∞.
Considering the first term, we have from dominated convergence (since |1Fk ∩(τ ≤N ) Xτ | ≤ |Xτ |,
Xτ is integrable, and (τ ≤ N ) ↑ (τ < ∞)) that
Z Z
lim Xτ dP = Xτ dP
N →∞ Fk ∩(τ ≤N ) Fk ∩(τ <∞)

For the second term we will use that Fk ∈ Fσ∧N ⊆ FN and (τ > N ) = (τ ≤ N )c ∈ FN , such
that Fk ∩ (τ > N ) ∈ FN . Then, since X closes the submartingale (Xn , Fn ), we obtain
Z Z
XN dP ≤ X dP
Fk ∩(τ >N ) Fk ∩(τ >N )
164 Martingales

where the right hand side converges using dominated convergence:

Z Z Z
lim X dP = X dP = Xτ dP
N →∞ Fk ∩(τ >N ) Fk ∩(τ =∞) Fk ∩(τ =∞)

Altogether we have shown

Z Z Z
Xk dP ≤ lim sup Xτ dP + XN dP
Fk N →∞ Fk ∩(τ ≤N ) Fk ∩(τ >N )
Z Z
≤ Xτ dP + Xτ dP
F ∩(τ <∞) Fk ∩(τ =∞)
Z k
= Xτ dP
Fk

which was the desired inequality (5.20).

5.5 The martingale central limit theorem

The principle, that sums of independent random variables almost follow a normal distribution,
is sound. But it underestimates the power of central limit theorems: Sums of dependent
variables are very often approximately normal as well. Many common dependence structures
are weak in the sense the terms may be strongly dependent on a few other variables, but
almost independent of the major part of the variables. Hence the sum of such variables will
have a probabilistic structure resembling the sum of independent variables.

An important theme of the probability theory in the 20th century has been the refinement of
this loose reasoning. How should ”weak dependence” be understood, and how is it possible to
inspect the difference between the sum of interest and the corresponding sum of independent
variables? Typically, smaller and rather specific classes of models have been studied, but the
general drive has been missing. Huge amounts of papers exists focusing on

1) U-statistics (a type of sums that are very symmetric).

2) Stationary processes.
3) Markov processes.

The big turning point was reached around 1970 when a group of mathematicians, more or
less independent of each other, succeeded in stating and proving central limit theorems in the
frame of martingales. Almost all later work in the area have been based on the martingale
results.
5.5 The martingale central limit theorem 165

In the following we will consider a filtered probability space (Ω, F, (Fn ), P ), and for notational
reasons we will often need a ”time 0” σ–algebra F0 . In the lack of any other suggestions, we
will use the trivial σ–algebra
F0 = {∅, Ω} .

Definition 5.5.1. A real stochastic process (Xn )n≥1 is a martingale difference, relative to
the filtration (Fn )n≥1 , if

(1) (Xn )n≥1 is adapted to (Fn )n≥1 ,

(2) E |Xn | < ∞ for all n ≥ 1 ,

(3) E(Xn | Fn−1 ) = 0 a.s. for all n ≥ 1 .

Note that a martingale difference (Xn )n≥1 satisfies that

E(Xn ) = E(E(Xn | Fn−1 )) = E(0) = 0 for all n ≥ 1 .

If (Xn )n≥1 is a martingale difference, then

n
X
Sn = Xi for n ≥ 1
i=1

is a martingale, relative to the same filtration, and all the variables in this martingale will
have mean 0. Conversely, if (Sn )n≥1 is a martingale with all variables having mean 0, then

X1 = S1 , Xn = Sn − Sn−1 for n = 2, 3, . . .

is a martingale difference. Hence a martingale difference somehow represents the same

stochastic property as a martingale, just with a point of view that is shifted a little bit.

Example 5.5.2. If the variables (Xn )n≥1 are independent and all have mean 0, then the
sequence form a martingale difference with respect to the natural filtration

Fn = σ(X1 , . . . , Xn ) ,

since
a.s.
E(Xn | Fn−1 ) = E(Xn | X1 , . . . , Xn−1 ) = E(Xn ) = 0 for all n ≥ 1 .
The martingale corresponding to this martingale difference is what we normally interpret as
a random walk. ◦ We will typically be interested in square-integrable martingale differences,
that is martingale differences (Xn )n≥1 such that

E Xn2 < ∞ for all n ∈ N .

166 Martingales

This leads to the introduction of conditional variances defined by

Vn = V (Xn | Fn−1 ) = E(Xn2 | Fn−1 ) a.s. for all n ≥ 1 .

It may also be useful to define the variables

n
X
Wn = Vm .
m=1

In the terminology of martingales the process (Wn )n≥1 is often denoted the compensator for
the martingale (Sn )n≥1 . It is easily shown that

Sn2 − Wn

is a martingale. It should be noted that in the case of a random walk, where the X–variables
are independent, then the compensator is non–random, more precisely
n
X
2
Wn = E Xm .
m=1

We shall study the so–called martingale difference arrays, abbreviated MDA’s. These are
triangular arrays (Xnm ) of real random variables,

X11
X21 X22
X31 X32 X33
.. .. .. ..
. . . .

such that each row in the array forms a martingale difference.

To avoid heavy notation we assume that the same fixed filtration (Fn )n≥1 is used in all rows.
In principle, it had been possible to use an entire triangular array of σ-algebras (Fnm ), since
we will not need anywhere in the arguments that the σ-algebras in different rows a related,
but in practice the higher generality will not be useful at all.

Under these notation–dictated conditions, the assumptions for being a MDA will be

1) Xnm is Fm –measurable for all n ∈ N, m = 1, . . . , n ,

2) E |Xnm | < ∞ for all n ≥ 1, m = 1, . . . , n ,

3) E(Xnm | Fm−1 ) = 0 a.s. for all n ≥ 1, m = 1, . . . , n .

Usually we will assume that all the variables in the array have second order moments. Sim-
ilarly to the notation in Section 3.5 we introduce the cumulated sums within rows, defined
5.5 The martingale central limit theorem 167

by
m
X
Snm = Xnk for n ≥ 1, m = 1, . . . , n .
k=1

A central limit theorem in this framework will be a result concerning the full row sums Snn
converging in distribution towards a normal distribution as n → ∞.

In Section 3.5 we saw that under a condition of independence within rows a central limit
theorem is constructed by demanding that the variance of the row sums converges towards a
fixed constant and that the terms in the sums are sufficiently small (Lyapounov’s conditions
or Lindeberg’s condition).

When generalizing to martingale difference arrays, it is still important to ensure that the
terms are small. But the condition concerning convergence of the variance of the row sums
is changed substantially. The new condition will be that the compensators of the rows
n
X
2
E(Xnm | Fm−1 ) , (5.21)
m=1

(that are random variables) converge in probability towards a non–zero constant. This con-
stant will serve as the variance in the limit normal distribution. Without loss of generality,
we shall assume that this constant is 1.

In order to ease notation, we introduce the conditional variances of the variables in the array

2
Vnm = E(Xnm | Fm−1 ) for n ≥ 1, m = 1, . . . , n ,

and the corresponding cumulated sums

m
X
Wnm = Vnk for n ≥ 1, m = 1, . . . , n ,
k=1

representing the compensators within rows. Note that the Vnm ’s are all non–negative (almost
surely), and that Wnm thereby grows as m increases. Furthermore note, that Wnm is Fm−1 –
measurable.

At first, we shall additionally assume that |Wnn | ≤ 2, or equivalently that

Wnm ≤ 2 a.s. for all n ≥ 1, m = 1, . . . , n .

We will use repeatedly that for a bounded sequence of random variables, that converges in
probability to constant, the integrals will also converge:
168 Martingales

Lemma 5.5.3. Let (Xn ) be a sequence of real random variables such that |Xn | ≤ C for all
P
n ≥ 1 and some constant C. If Xn −→ x, then EXn → x as well.

Proof. Note that (Xn )n≥1 is uniformly integrable, and thereby it converges in L1 . Hence the
convergence of integrals follows.

Lemma 5.5.4. Let (Xn m ) be a triangular array consisting of real random variables with
third order moment. Assume that there exists a filtration (Fn )n≥1 , making each row in the
array a martingale difference. Assume furthermore that
n
P
X
2
E(Xnm | Fm−1 ) → 1 as n → ∞ .
m=1

If
n
X
2
E(Xnm | Fm−1 ) ≤ 2 a.s. for all n ≥ 1, (5.22)
m=1

and if the array fulfils Lyapounov’s condition

n
X
E |Xnm |3 → 0 as n → ∞ , (5.23)
m=1
Pn
then the row sums Snn = m=1 Xnm will satisfy

wk
Snn −→ N (0, 1) as n → ∞ .

Note that it is not important which upper bound is used in (5.22) - the number 2 could be
replaced by any constant c > 1 without changing the proof and without changing the utility
of the lemma.

Proof. It will be enough to show the following convergence

Z
2
ei Snn t+Wnn t /2 dP → 1 as n → ∞ (5.24)

for each t ∈ R. That is seen from the following: Let φn (t) be the characteristic function of
Snn , then we have
Z Z Z
t2 /2 i Snn t+t2 /2 i Snn t+Wnn t2 /2 2 2
φn (t) e = e dP = e dP + ei Snn t (et /2 − eWnn t /2 ) dP .
5.5 The martingale central limit theorem 169

P
Since we have assumed that Wnn −→ 1 and the function x 7→ exp(t2 /2) − exp(xt2 /2) is
continuous in 1 (with the value 0 in 1), we must have that
2 2 P
et /2
− eWnn t /2
−→ 0 .

Then
2 2 2 2
P eiSnn t et /2
− eWnn t /2
> = P et /2 − eWnn t /2 > → 0

so the integrand in the last integral above converges to 0 in probability. Furthermore recall
2
that Wnn is bounded by 2, so the integrand must be bounded by et . Then the integral
converges to 0 as n → ∞ because of Lemma 5.5.3. So if (5.24) is shown, we will obtain that
2
φn (t)et /2 → 1 as n → ∞ which is equivalent to having obtained
2
φn (t) → e−t /2
as n → ∞ .
wk
and according to the Theorem 3.4.20 we have thereby shown that Snn −→ N (0, 1).

In order to show (5.24), we define the variables

2 2
Qnm = eiSnm t+Wnm t /2
Q̃nm = eiSn (m−1) t+Wnm t /2

and we will be done, if we can show

Z
E(Qnn − 1) = Qnn dP − 1 → 0

Firstly, we can rewrite (when defining Sn0 = 0 and Qn0 = 1)

n
X
Qnn − 1 = Qnm − Qn (m−1)
m=1

and we observe
2
Qnm = eiXn m t Q̃n m Qn (m−1) = e−Vn m t /2
Q̃nm
such that
n
X 2
eiXnm t − e−Vnm t /2

Qnn − 1 = Q̃nm .
m=1
Recall from the definitions, that both Sn (m−1) and Wnm are Fm−1 –measurable, such that
Q̃nm is Fm−1 –measurable as well. Then we can write
n
X 2
E (eiXnm t − e−Vnm t /2

E(Qnn − 1) = )Q̃nm
m=1
n
X 2
E E (eiXnm t − e−Vn m t /2

= )Q̃nm |Fm−1
m=1
n
X 2
E E (eiXnm t − e−Vn m t /2

= )|Fm−1 Q̃nm
m=1
170 Martingales

Furthermore recall that |Wnm | ≤ |Wnn | ≤ 2 a.s., such that

2 2
|Q̃nm | = |eiSn (m−1) t ||eWnm t /2
| ≤ et a.s.

such that we together with using the triangle inequality obtain

We have from Lemma 3.5.2 that

And if we apply the upper bounds for the remainder terms, we obtain
2
E eiXnm t − e−Vnm t /2 |Fm−1 = |E(r1 (Xnm t)|Fm−1 ) − r2 (Vnm t2 )|

≤ E(|r1 (Xnm t)| |Fm−1 ) + |r2 (Vnm t2 )|

|t|3 V 2 t4
≤ E(|Xnm |3 |Fm−1 ) + nm
3 8
Collecting all the obtained inequalities yields that
n
|t|3 X n
2 t4 X
|E(Qnn − 1)| ≤ et E|Xnm |3 + 2
EVnm
3 m=1
8 m=1

That the first sum above converges to 0, is simply the Lyapounov condition that is assumed
to be true in the lemma. Hence the proof will be complete, if we can show that
n
X
2
EVnm →0
m=1
5.5 The martingale central limit theorem 171

as n → ∞. This is obviously the same as showing that

n
X
2
E Vnm →0 as n → ∞ . (5.25)
m=1

The integrand above must have the following upper bound

n
X n
X n
2
X
Vnm ≤ Vnm max Vnk = max Vnk Vnm (5.26)
k=1,...,n k=1,...,n
m=1 m=1 m=1
Pn
and this must be bounded by 4, since m=1 Vnm ≤ 2 such that especially also all Vnk ≤ 2.
Hence the integrand in (5.25) is bounded, so Lemma 5.5.3 gives that the proof is complete,
if we can show
n
P
X
2
Vnm −→ 0 ,
m=1

which (because of the inequality (5.26) ) will be the case, if we can show
n
X P
max Vnk Vnm −→ 0 .
k=1,...,n
m=1

Pn P
Since we have assumed that m=1 Vnm −→ 1, it will according to Theorem 3.3.3 be enough
to show that
P
max Vnk −→ 0 . (5.27)
k=1,...,n

In order to show (5.27) we will utilize the fact that for each c > 0 exists a d > 0 such that

x2 ≤ c + d|x|3 for all x ∈ R .

So let c > 0 be some arbitrary number and find the corresponding d > 0. Then

2
Vnm = E(Xnm |Fm−1 ) ≤ E(c + d|Xnm |3 |Fm−1 )
n
X
= c + dE(|Xnm |3 |Fm−1 ) ≤ c + d E(|Xnm |3 |Fm−1 ) ,
m=1

and since this upper bound does not depend on m, we have the inequality
n
X
max Vnm ≤ c + d E(|Xnm |3 |Fm−1 )
m=1,...,n
m=1

Then from integration

n
X n
X
E E(|Xnm |3 |Fm−1 ) = c + d E|Xnm |3

E max Vnm ≤ c + d
m=1,...,n
m=1 m=1
172 Martingales

and letting n → ∞ yields

n
X n
X
E|Xnm |3 = c + d lim E|Xnm |3 = c .

lim sup E max Vnm ≤ lim sup c + d
n→∞ m=1,...,n n→∞ n→∞
m=1 m=1

Since c > 0 was chosen arbitrarily, and the left hand side is independent of c, we must have
(recall that it is non–negative)

lim E max Vnm = 0

n→∞ m=1,...,n

And since E maxm=1,...,n Vnm = E| maxm=1,...,n Vnm − 0|, we actually have that

L1
max Vnm −→ 0
m=1,...,n

which in particular implies (5.27).

Theorem 5.5.5 (Brown). Let (Xnm ) be a triangular array of real random variables with
third order moment. Assume that there exists a filtration (Fn )n≥1 that makes each row in
the array a martingale difference. Assume furthermore that
n
P
X
2
E(Xnm | Fm−1 ) → 1 as n → ∞ . (5.28)
m=1

If the array fulfils the conditional Lyapounov condition

n
X P
E |Xnm |3 | Fm−1 → 0 as n → ∞ , (5.29)
m=1
Pn
then the row sums Snn = m=1 Xnm satisfies that

D
Snn −→ N (0, 1) as n → ∞

Proof. Most of the work is already done in lemma 5.5.4 - we only need to use a little bit of
martingale technology in order to reduce the general setting to the situation in the lemma.

Analogous to the Wnm -variables from before we define the cumulated third order moments
within each row
m
X
E |Xnk |3 | Fk−1 .

Znm =
k=1
5.5 The martingale central limit theorem 173

Furthermore define the variable

∗
Xnm = Xnm 1(Wnm ≤2, Znm ≤1) .

It is not important exactly which upper limit is chosen above for the Z–variables – any
strictly positive upper limit would give the same results as 1. The trick will be to see that
∗
the triangular array (Xnm ) consisting of the star variables fulfils the conditions from Lemma
5.5.4.

Note that since both Wnm and Znm are Fm−1 –measurable, then the indicator function will
∗
be Fm−1 –measurable. Hence each Xnm must be Fm –measurable. Furthermore (using that
(Xnm ) is a martingale difference array)
∗
E|Xnm | = E|Xnm 1(Wnm ≤2,Znm ≤1) | ≤ E|Xnm | < ∞

and also
∗
E(Xnm |Fm−1 ) = E(Xnm |Fm−1 )1(Wnm ≤2,Znm ≤1) = 0 .
∗
Altogether this shows that (Xnm ) is a martingale difference array. We will define the variables
∗ ∗ ∗
Vnm and Wnm for (Xnm ) similar to the variables Vnm and Wnm for (Xnm ):
m
X
∗ ∗ ∗ ∗
Vnm = E((Xnm )2 |Fm−1 ) , Wnm = Vnk .
k=1

Then (as before)

∗
Vnm = Vnm 1(Wnm ≤2,Znm ≤1)
so
n
X
∗
Wnm = Vnk 1(Wnk ≤2,Znk ≤1) .
k=1
∗
From this we obtain that Wnn ≤ 2 (we only add V ’s to the sum as long as the W ’s are below
2). We also have that
∗
Wnm = Wnm for m = 1, . . . , n on (Wnn ≤ 2, Znn ≤ 1) (5.30)

because, since both Wnk and Znk are increasing in k, all the indicator functions 1(Wnk ≤2,Znk ≤1)
P P
are 1 on the set (Wnn ≤ 2, Znn ≤ 1). Since Wnn −→ 1 and Znn −→ 0 it holds that
P (Wnn ≤ 2) = P (|Wnn − 1| ≤ 1) → 1 and P (Znn ≤ 1) = P (|Znn − 0| ≤ 1) → 1. Hence also
P (Wnn ≤ 2, Znn ≤ 1) → 1. Combining this with (5.30) yields
∗ ∗
1 ≥ P (|Wnn − Wnn − 0| ≤ ) ≥ P (Wnn = Wnn ) ≥ P (Wnn ≤ 2, Znn ≤ 1) → 1 ,
∗ P
which shows that Wnn − Wnn −→ 0. Then
∗ ∗ P
Wnn = (Wnn − Wnn ) + Wnn −→ 0 + 1 = 1 .
174 Martingales

To be able to apply Lemma 5.5.4 to the triangular array we still need to show that the array
satisfies the unconditional Lyapounov condition (5.23). For this define
n
X n
X
∗ ∗ 3
Znn = E(|Xnk | |Fk−1 ) = E(|Xnk |3 |Fk−1 )1(Wnk ≤2,Znk ≤1)
k=1 k=1

∗
It is obvious that Znn ≤ 1 and from using that all terms in Znm are non–negative, such that
∗
Znm increases for m = 1, . . . , n we also see (like above) that Znn ≤ Znn . The assumption
P
Znn −→ 0 then implies

∗ ∗
P (|Znn − 0| > ) = P (Znn > ) ≤ P (Znn > ) → 0

∗ P ∗ ∗
so Znn −→ 0. The fact that all 0, ≤ Znn ≤ 1, makes (Znn ) uniformly integrable. For x > 1:
Z Z
∗ ∗
sup Znn dP = Znn dP = 0
n∈N ∗ ≥x)
(Znn ∅

∗ L1 ∗ ∗
So Theorem 5.4.5 gives, that Znn −→ 0. Hence E(Znn ) = E|Znn − 0| → 0, such that
n
X n
X n
X
∗ 3 ∗ 3 ∗ 3 ∗

E|Xnk | = E E(|Xnk | |Fk−1 ) = E E(|Xnk | |Fk−1 ) = E(Znn )→0
k=1 k=1 k=1

Summarising the results so far, we have shown that all conditions from Lemma 5.5.4 are
∗
satisfied for the martingale difference array (Xnm ). Then the lemma gives that
n
D
X
∗
Xnm −→ N (0, 1) .
m=1

We have already argued, that on the set (Wnn ≤ 2, Znn ≤ 1) all the indicator functions
1(Wnk ≤2,Znk ≤1) are 1. Hence

∗
Xnm = Xnm for all m = 1, . . . , n on (Wnn ≤ 2, Znn ≤ 1)

so also
n
X n
X
∗
Xnm = Xnm on (Wnn ≤ 2, Znn ≤ 1) .
m=1 m=1

Then (using an argument similar to the previous) we obtain

n
X n
X
∗
1≥P Xnm − Xnm − 0 ≤ ≥ P (Wnn ≤ 2, Znn ≤ 1) → 1
m=1 m=1

so
n n
P
X X
∗
Xnm − Xnm −→ 0 .
m=1 m=1
5.6 Exercises 175

Referring to Slutsky’s lemma completes the proof since then

n n n
X n
D
X X X
∗ ∗
Xnm = Xnm + Xnm − Xnm −→ N (0, 1)
m=1 m=1 m=1 m=1

By some work it is possible to replace the third order conditions by some Lindeberg–inspired
conditions. It is sufficient that all the X–variables have second order moment, satisfy (5.28),
and fulfil
n
X P
E Xn2 m 1(|Xn m |>c) | Fm−1 → 0 as n → ∞ , (5.31)
m=1
for all c > 0 in order for the conclusion in Brown’s Theorem to be maintained.

5.6 Exercises

All random variables in the following exercises are assumed to be real valued. Exercise
5.1. Characterise the mean function n → E(Xn ) if (Xn , Fn ) is

(1) a martingale.

(2) a submartingale.

(3) a supermartingale.

Show that a submartingale is a martingale, if and only if the mean function is constant. ◦

Exercise 5.2. Let (Fn ) be a filtration on (Ω, F) and assume that τ and σ are stopping
times. Show that τ ∨ σ and τ + σ are stopping times. ◦

Exercise 5.3. Let X1 , X2 , . . . be independent and identically distributed real random vari-
ables such that EX1 = 0 and V X1 = σ 2 . Let (Fn ) be the filtration on (Ω, F) defined by
Fn = F(X1 , . . . , Xn ). Define
n
!2
X
Yn = Xk
k=1

Zn = Yn − nσ 2
176 Martingales

Show that (Yn , Fn ) is a submartingale and that (Zn , Fn ) is a martingale. ◦

Exercise 5.4. Assume that (Xn , Fn ) is an adapted sequence, where each Xn is a real valued
random variable. Let A ∈ B. Define τ : Ω → N ∪ {∞} by

τ (ω) = inf{n ∈ N : Xn (ω) ∈ A} .

Show that τ is a stopping time. ◦

Exercise 5.5. Let (Fn ) be a filtration on (Ω, F). Let τ : Ω → N∪{∞} be a random variable.
Show that τ is a stopping time if and only if there exists a sequence of sets (Fn ), such that
Fn ∈ Fn for all n ∈ N and
τ (ω) = inf{n ∈ N : ω ∈ Fn }

Exercise 5.6. Let (Fn ) be a filtration on (Ω, F) and consider a sequence of sets (Fn ) where
for each n Fn ∈ Fn . Let σ be a stopping time and define

τ (ω) = inf{n > σ(ω) : ω ∈ Fn }

Show that τ is a stopping time. ◦

Exercise 5.7.

(1) Assume that (Xn , Fn ) is an adapted sequence. Show that (Xn , Fn ) is a martingale if
and only if
E(Xk |Fτ ) = Xτ a.s.

for all k ∈ N and all stopping times τ , where τ ≤ k.

(2) Show that if (Xn , Fn ) is a martingale and τ ≤ m ∈ N, then

E(Xτ ) = E(X1 )

(3) Show that if (Xn , Fn ) is a submartingale and τ ≤ m ∈ N, then

E(X1 ) ≤ E(Xτ ) ≤ E(Xm )

◦
5.6 Exercises 177

Exercise 5.8. Let (X1 , X2 , . . .) be a sequence of independent and identically distributed

random variables, such that X1 ∼ P ois(λ). Define for n ∈ N
n
X
Sn = Xk and Fn = F(X1 , . . . , Xn ) .
k=1

(1) Show that (Sn − nλ, Fn ) is a martingale.

(2) Define
τ = inf{n ∈ N : Sn ≥ 1}

Show that τ is a stopping time.

(3) Show that E(Sτ ∧n ) = λE(τ ∧ n) for all n ∈ N.

(4) Argue that P (τ < ∞) = 1.

(5) Show that E(Sτ ) = λE(τ ).

Exercise 5.9. Let X1 , X2 , . . . be a sequence of independent random variables with EXn = 0

P∞
for all n ∈ N. Assume that n=1 EXn2 < ∞ and define Sn = X1 + · · · + Xn for all n ∈ N.
Show that limn→∞ Sn exists almost surely. ◦

Exercise 5.10. Assume that X1 , X2 , . . . are independent and identically distributed random
variables with
P (Xn = 1) = p P (Xn = −1) = 1 − p ,

where 0 < p < 1 with p 6= 21 . Define

Sn = X1 + · · · + Xn

and Fn = F(X1 , . . . , Xn ) for all n ≥ 1.

1−p
(1) Let r = p and show that E(Mn ) = 1 for all n ≥ 1, where

Mn = rSn .

and show that (Mn , Fn )n∈N is a martingale.

(2) Show that M∞ = limn→∞ Mn exists a.s.

178 Martingales

(3) Show that EM∞ ≤ 1.

(4) Show that

1
Sn = 2p − 1 a.s.
lim
n→∞ n
and conclude that Sn → +∞ a.s. if p > 12 and Sn → −∞ a.s. if p < 21 .

(5) Let a, b ∈ Z with a < 0 < b and define

τ = inf{n ∈ N | Sn = a or Sn = b}

Show that τ is a stopping time.

(6) Show that P (τ < ∞) = 1, and realise that P (Sτ ∈ {a, b}) = 1.

(7) Show that EMτ ∧n = 1 for all n ≥ 1.

(8) Show that for all n ∈ N

|Sτ ∧n | ≤ |a| ∨ b .
and conclude that the sequence (Mτ ∧n )n∈N is bounded.

(9) Show that EMτ = 1.

(10) Show that

1 − ra
P (Sτ = b) =
rb − ra
rb − 1
P (Sτ = a) = b
r − ra

Exercise 5.11. The purpose of this exercise is to show that for a random variable X with
EX 2 < ∞ and a sub σ–algebra D of F we have the following version of Jensen’s inequality
for conditional expectations

E(X 2 |D) ≥ E(X|D)2 a.s.

(1) Show that x2 − y 2 ≥ 2y(x − y) for all x, y ∈ R and show that for all n ∈ N it holds that

1Dn X 2 − E(X|D)2 ≥ 1Dn 2E(X|D) X − E(X|D) ,

where Dn = (|E(X|D)| ≤ n). Show that both the left hand side and the right hand
side are integrable.
5.6 Exercises 179

(2) Show that

E 1Dn 2E(X|D) X − E(X|D) |D = 0 a.s.

(3) Show that

1Dn E(X 2 |D) ≥ 1Dn E(X|D)2 a.s. for all n ∈ N.
and conclude
E(X 2 |D) ≥ E(X|D)2 a.s.

Exercise 5.12.(The Chebychev–Kolmogorov inequality) Let (Xn , Fn ) be a martingale where

EXn2 < ∞ for all n ∈ N.

(1) Show that (Xn2 , Fn ) is a submartingale.

(2) Define for some > 0

τ = inf{n ∈ N : |Xn | ≥ }
Show that τ is a stopping time.

(3) Show that for some n ∈ N it holds that

EXτ2∧n ≤ EXn2

(4) Show that

EXτ2∧n ≥ 2 P ( max |Xk | ≥ )
k=1,...,n

(5) Conclude the Chebychev–Kolmogorov Inequality:

EXn2
P ( max |Xk | ≥ ) ≤
k=1,...,n 2

Exercise 5.13.(Doob’s Inequality) Assume that (Yn , Fn )n∈N is a submartingale. Let t > 0
be a given constant.

(1) Define τ : Ω → N ∪ {∞} by

τ = inf{k ∈ N : Yk ≥ t}

Show that τ is a stopping time.

180 Martingales

(2) Let n ∈ N and define

An = ( max Yk ≥ t)
k=1,...,n

Use the definition of An and τ to show

Z
tP (An ) ≤ Yτ ∧n dP
An

(3) Show ”Doob’s Inequality”: Z

tP (An ) ≤ Yn dP
An

Exercise 5.14.

(1) Assume that X1 , X2 , . . . are random variables with each Xn ≥ 0 and E|Xn | < ∞, such
that Xn → 0 a.s. and EXn → 0. Show that (Xn ) is uniformly integrable.

(2) Find a sequence X1 , X2 , . . . of random variables on (Ω, F, P ) = ([0, 1], B, λ) such that

Xn → 0 a.s. and EXn → 0

but where (Xn ) is not uniformly integrable.

Exercise 5.15. Let X be a random variable with E|X| < ∞. Let G be collection of sub
σ–algebras of F. In this exercise we shall show that the following family of random variables

(E(X|D))D∈G

is uniformly integrable.

(1) Let D ∈ G. Show that for all x > 0 it holds that

Z Z
|E(X|D)| dP ≤ |X| dP
(|E(X|D)|>x) (E(|X||D)>x)

(2) Show that for all K ∈ N and x > 0

Z Z
E|X|
|X| dP ≤ |X| dP + K
(E(|X||D)>x) (|X|>K) x
5.6 Exercises 181

(3) Show that (E(X|D))D∈G is uniformly integrable.

Exercise 5.16. Assume that (Ω, F, (Fn ), P ) is a filtered probability space. Let τ be a
stopping time with Eτ < ∞. Assume that (Xn , Fn ) is a martingale.

(1) Argue that Xτ is almost surely well–defined and that Xτ ∧n → Xτ a.s.

L1
(2) Assume that (Xτ ∧n )n∈N is uniformly integrable. Show that E|Xτ | < ∞ and Xτ ∧n →
Xτ .

(3) Assume that (Xτ ∧n )n∈N is uniformly integrable. Show that EXτ = EX1 .

(4) Assume that a random variable Y exists such that E|Y | < ∞ and |Xτ ∧n | ≤ |Y | a.s.
for all n ∈ N. Show that (Xτ ∧n ) is uniformly integrable.

In the rest of the exercise you can use without proof that
∞
X
|Xτ ∧n | ≤ |X1 | + 1(τ >m) |Xm+1 − Xm | (5.32)
m=1

for all n ∈ N.

(5) Show that

∞
X
Eτ = P (τ > n)
n=0

(6) Assume that there exists a constant B > 0 such that

E(|Xn+1 − Xn | Fn ) ≤ B a.s.

for all n ∈ N. Show that EXτ = EX1

Let Y1 , Y2 , . . . be a sequence of independent and identically distributed random variables

satisfying E|Y1 | < ∞. Define Gn = F(Y1 , . . . , Yn ) and
n
X
Sn = Yk Zn = Sn − nξ
k=1

where ξ = EY1 .
182 Martingales

(7) Show that (Zn , Fn ) is a martingale.

Let σ be a stopping time (with respect to the filtration (Gn )) such that Eσ < ∞.

(8) Show that E(|Zn+1 − Zn | Gn ) = E(|Y1 − ξ|) a.s for all n ∈ N.

(9) Show that ESσ = EσEY1 .

Exercise 5.17. Assume that X1 , X2 , . . . are independent random variables such that for
each n it holds Xn ≥ 0 and EXn = 1. Define Fn = F(X1 , . . . , Xn ) and
n
Y
Yn = Xk
k=1

(1) Show that (Yn , Fn ) is a martingale.

(2) Show that Y = limn→∞ Yn exists almost surely with E|Y | < ∞.

(3) Show that 0 ≤ EY ≤ 1.

Assume furthermore that all Xn ’s are identically distributed satisfying

1 3 1
P (Xn = ) = P (Xn = ) = .
2 2 2

(4) Show that Y = 0 a.s.

L1
(5) Conclude that there does not exist a random variable Z such that Yn −→ Z.

Exercise 5.18. Let X1 , X2 , . . . be a sequence of real random variables with E|Xn | < ∞ for
all n ∈ N. Assume that X is another random variable with E|X| < ∞. The goal of this
exercise is to show
L1 P
Xn −→ X if and only if E|Xn | → E|X| and Xn −→ X
5.6 Exercises 183

L1 P
(1) Assume that Xn −→ X. Show that E|Xn | → E|X| and Xn −→ X.

Let U1 , U2 , . . . and V, V1 , V2 , . . . be two sequences of random variables such that E|V | < ∞
and for all n ∈ N

E|Vn | < ∞
|Un | ≤ Vn
a.s.
Vn −→ V as n → ∞
EVn → EV as n → ∞

(2) Apply Fatou’s lemma on the sequence (Vn − |Un |) to show that

lim sup E|Un | ≤ E lim sup |Un |

n→∞ n→∞

Hint: You can use that if (an ) is a real sequence, then

lim inf (−an ) = − lim sup an

n→∞ n→∞

and if (bn ) is another real sequence with bn → b, then

lim inf bn = b
n→∞

and

lim inf (an + bn ) = (lim inf an ) + b

n→∞ n→∞

lim sup(an + bn ) = (lim sup an ) + b

n→∞ n→∞

L1
(3) Use (2) to show that if E|Xn | → E|X| and Xn ∩ X then Xn −→ X.
P L1
(4) Assume that E|Xn | → E|X| and Xn −→ X. Show that Xn −→ X.

Now let (Yn , Fn ) be a martingale. Assume that a random variable Y exists with E|Y | < ∞,
P
such that Yn −→ Y .

a.s.
(5) Assume that E|Yn | → E|Y |. Show that Yn −→ Y .

(6) Show that Y closes the martingale if and only if E|Yn | → E|Y |.
184 Martingales

Exercise 5.19. Consider the gambling strategy discussed in Section 5.1: Let Y1 , Y2 , . . . be
independent and identically distributed random variables with

P (Y1 = 1) = p P (Y1 = −1) = 1 − p ,

where 0 < p < 21 . We think of Yn as the the result of a game, where the probability of
winning is p, and where if you bet 1 dollar, you will receive 1 dollar if you win, and lose the 1
dollar, if you lose the game. We consider the sequence of strategies where the bet is doubled
for each lost game, and when a game finally is won, the bet is reset to 1. That is defining
the sequence of strategies (φn ) such that

φ1 = 1

and furthermore recursively for n ≥ 2

(
2φn−1 (y1 , . . . , yn−2 ) if yn−1 = −1
φn (y1 , . . . , yn−1 ) =
1 if yn−1 = 1

Then the winnings in the n’th game is

Yn φn (Y1 , . . . , Yn−1 )

and the total winnings in game 1, . . . , n is

n
X
Xn = Yk φk (Y1 , . . . , Yk−1 )
k=1

If e.g. we lose the first three games and win the fourth, then

X1 = −1, X2 = −1 − 2, X3 = −1 − 2 − 22 , X4 = −1 − 2 − 22 + 23 = 1

Define for each n ∈ N the σ–algebra Fn = σ(Y1 , . . . , Yn ).

(1) Show that (Xn , Fn ) is a true supermartingale (meaning a supermartingale that is not
a martingale).

Define the sequence (τk ) by

τ1 = inf{n ∈ N | Yn = 1}
τk+1 = inf{n > τk | Yn = 1}
5.6 Exercises 185

(2) Show that (τk ) is a sequence of sampling times.

(3) Realise that Xτk = k for all k ∈ N and conclude that (Xτk , Fτk ) is a true submartingale.

Hence we have stopped a true supermartingale and obtained a true submartingale!! In the
next questions we shall compare that result to Theorem 5.4.9.

Pn
(4) See that on the set (τ1 > n) we must have Xn = − k=1 2k−1 = 1 − 2n and show that
Z
Xn− dP = q n − (2q)n → −∞ as n → ∞ ,
(τ1 >n)

where q = 1 − p.

(5) Compare the result from 4 with assumption 2 in Theorem 5.4.9.

Now assume that we change the strategy sequence (φn ) in such a way, that we limit our
betting in order to avoid Xn < −7. Hence we always have Xn ≥ −7. Since all bettings are
non–negative we still have, that (Xn , Fn ) is a supermartingale.

(6) Let (σk ) be an increasing sequence of stopping times. Show that (Xσk , Fσk ) is a super-
martingale.

Exercise 5.20. The purpose of this exercise is to show the following theorem:

Let (Xn ) be a martingale and assume that for some p > 1 it holds that

sup E|Xn |p < ∞ .

n≥1

Then a random variable X exists with E|X|p < ∞ such that

a.s. Lp
Xn −→ X , Xn → X

a.s.
(1) Assume that supn≥1 E|Xn |p < ∞. Show that there exists X such that Xn −→ X and
E|X| < ∞.
186 Martingales

(2) Assume that both supn≥1 E|Xn |p < ∞ and E supn |Xn |p < ∞. Show that E|X|p <
Lp
∞ (with X from 1)) and Xn → X.

In the rest of the exercise we shall show that E supn |Xn |p < ∞ under the assumption that
supn≥1 E|Xn |p < ∞.

(3) Assume that Z ≥ 0 and let r > 0. Show

Z ∞
EZ r = rtr−1 P (Z ≥ t) dt
0

Define Mn = max1≤k≤n |Xk |.

(4) Show that Z ∞ Z

EMnp ≤ pt p−2
|Xn | dP dt
0 (Mn ≥t)

(5) Show that

p
EMnp ≤ E Mnp−1 |Xn |

p−1

Recall that Hölder’s Inequality gives: If p, q > 1 such that p1 + 1

q = 1 and furthermore Y , Z
are random variables with E|Y |p < ∞ and E|Z|q < ∞, then
1/p 1/q
E|Y Z| ≤ E|Y |p E|Z|q .

(6) Show that p p

EMnp ≤ E(|Xn |p )
p−1

(7) Conclude that E supn |Xn |p < ∞ under the assumption supn∈N E|Xn |p < ∞.

Exercise 5.21. (Continuation of Exercise 5.10)

Assume that X1 , X2 , . . . are independent and identically distributed random variables with
1
P (Xn = 1) = P (Xn = −1) = .
2
5.6 Exercises 187

Define
Sn = X1 + · · · + Xn

and Fn = σ(X1 , . . . , Xn ) for all n ∈ N.

(1) Show that (Sn , Fn ) and (Sn2 − n, Fn ) are martingales.

Let a, b ∈ Z with a < 0 < b and define

τa,b = inf{n ∈ N : Sn = a or Sn = b}

It was seen in Exercise 5.10, that τa,b is a stopping time.

(2) Show for all n, m ∈ N that

n
Y
P (τa,b > nm) ≤ P (|Skm − S(k−1)m | < b − a) = P (|Sm | < b − a)n
k=1

Defining S0 = 0.

(3) Show that P (τa,b < ∞) = 1 and conclude that P (Sτa,b ∈ {a, b}) = 1.

(4) Show that ESτa,b ∧n = 0 for all n ∈ N and conclude that ESτa,b = 0.

(5) Show that

b −a
P (Sτa,b = a) = P (Sτa,b = b) =
b−a b−a
(6) Show that ESτ2a,b = Eτa,b and conclude that Eτa,b = −ab.

Define the stopping time

τb = inf{n ∈ N : Sn = b}

T∞
(7) Show that P (F ) = 1, where F = n=1 (τ−n,b < ∞).

(8) Show P ((τ−n,b = τb ) ∩ F ) → 1 as n → ∞. Conclude that P (G) = 1, where

∞
!
[
G= (τ−n,b = τb ) ∩ F
n=1

(9) Show that P (τb < ∞) = 1.

188 Martingales

(10) Show that Eτb = ∞.

(11) Argue that Z

lim inf |SN | dP 6= 0
N →∞ (τb >N )

(12) Show that

P (sup Sn = ∞) = 1
n≥1

From symmetry it is seen that also

P ( inf Sn = −∞) = 1
n≥1

Exercise 5.22. Let (Ω, F, Fn , P ) be a filtered probability space, and let (Yn )n≥1 and (Zn )n≥1
be two adapted sequences of real random variables. Define furthermore Z0 ≡ 1. Assume that
Y1 , Y2 , . . . are independent and identically distributed with E|Y1 |3 < ∞ and EY1 = 0. Assume
furthermore that for all n ≥ 2 it holds that Yn is independent of Fn−1 . Finally, assume that
E|Zn |3 < ∞ for all n ∈ N. Define for all n ∈ N
n
X
Mn = Zm−1 Ym
m=1

(1) Show that (Mn , Fn ) is a martingale.

(2) Assume that

n−1
1 X 2 P
Z −→ α2 > 0
n m=0 m
and
n−1
1 X a.s.
|Zm |3 −→ 0 .
n3/2 m=0
Show that
1 D
√ Mn −→ N (0, α2 σ 2 ) ,
n
where σ 2 = EY12 .

Define N1 ≡ Y1 and for n ≥ 2

n
X 1
Nn = Y1 + Ym−1 Ym
m=2
m
5.6 Exercises 189

(3) Argue that (Nn , Fn ) is a martingale.

(4) Show that for all n ≥ 2

1 2 2
ENn2 = ENn−1
2
+ (σ )
n2
(5) Show that the sequence (Nn ) is uniformly integrable.

(6) Show that N∞ = limn→∞ Nn exists almost surely and in L1 . Find EN∞ .

(7) Show that for 1 ≤ i < j it holds that

EYi−1 Yi Yj−1 Yj = 0

and use this to conclude that for k, n ∈ N

n+k
X 1
E(Nn+k − Nn )2 = (σ 2 )
m=n+1
m2

(8) Show that Nn → N∞ in L2 .

Define for all n ∈ N

n
X
Mn∗ = Ym−1 Ym
m=2

with the definition Y0 ≡ 1. In the following questions you can use Kronecker’s Lemma that is
Pn
a mathematical result: If (xn ) is a sequence of real numbers such that limn→∞ k=1 xk = s
exists, and if 0 < b1 ≤ b2 ≤ · · · with bn → ∞, then
n
1 X
lim bk x k = 0 .
n→∞ bn
k=1

(9) Show that

1 ∗
lim M =0 a.s.
n→∞ n n
(10) Use the strong law of large numbers to show
n
1 X 2 a.s. 2
Yk −→ σ a.s.
n
k=1

and
n
1 X a.s.
|Yk |3 −→ 0 a.s.
n3/2 k=1
190 Martingales

(11) Show that

1 D
√ Mn∗ −→ N (0, (σ 2 )2 )
n

◦
Chapter 6

The Brownian motion

The first attempt to define the stochastic process which is now known as the Brownian motion
was made by the Frenchman Bachelier, who at the end of the 19th century tried to give a
statistical description of the random price fluctuations on the stock exchange in Paris. Some
years later, a variation of the Brownian motion is mentioned in Einstein’s theory of relativity,
but the first precise mathematical definition is due to Norbert Wiener (1923) (which explains
the name one occasionally sees: the Wiener process). The Frenchman Paul Lévy explored
and discovered some of the fundamental properties of Brownian motion and since that time
thousands of research papers have been written concerning what is unquestionably the most
important of all stochastic processes.

Brown himself has only contributed his name to the theory of the process: he was a botanist
and in 1828 observed the seemingly random motion of flower pollen suspended in water,
where the pollen grains constantly changed direction, a phenomenon he explained as being
caused by the collision of the microscopic pollen grains with water molecules.

So far the largest collection of random variables under study have been sequences indexed by
N. In this chapter we study stochastic processes indexed by [0, ∞). In Section 6.1 we discuss
how to define such processes indexed by [0, ∞), and we furthermore define and show the
existence of the important Brownian motion. In following sections we study the behaviour
of the so–called sample paths of the Brownian motion. In Section 6.2 we prove that there
exists a continuous version, and in the remaining sections we study how well–behaved the
sample paths are – apart from being continuous.
192 The Brownian motion

6.1 Definition and existence

We begin with a brief presentation of some definitions and results from the general theory of
stochastic processes.

Definition 6.1.1. A stochastic process in continuous time is a family X = (Xt )t≥0 of real
random variables, defined on a probability space (Ω, F, P ).

In Section 2.3 we regarded a sequence (Xn )n≥1 of real random variables as a random variable
with values in R∞ equipped with the σ–algebra B∞ . Similarly we will regard a stochastic
process X in continuous time as having values in the space R[0,∞) consisting of all functions
x : [0, ∞) → R. The next step is to equip R[0,∞) with a σ–algebra. For this, define the
coordinate projections X̂t by

X̂t (x) = xt for x ∈ R[0,∞)

for all t ≥ 0. Then we define

Definition 6.1.2. Let B[0,∞) denote the smallest σ–algebra that makes X̂t (B[0,∞) − B)
measurable for all t ≥ 0.

Then we have

Lemma 6.1.3. Let X : Ω → R[0,∞) . Then X is is F − B[0,∞) measurable, if and only if

X̂t ◦ X is F − B measurable for all t ≥ 0.

Proof. The proof will be identical to the proof of Lemma 2.3.3: If X is F − B[0,∞) measurable
we can use that X̂t by definition is B[0,∞) − B measurable, so the composition is F − B
measurable. Conversely, assume that X̂t ◦ X is F − B measurable for all t ≥ 0. To show that
X is F − B[0,∞) measurable, it suffices to show that X −1 (A) ∈ F for all A in the generating
system H = {X̂t−1 (B) | t ≥ 0, B ∈ B} for B[0,∞) . But for any t ≥ 0 and B ∈ B we have
X −1 (X̂t−1 (B)) = (X̂t ◦ X)−1 (B) ∈ F by our assumptions.

Lemma 6.1.4. Let X = (Xt )t≥0 be a stochastic process. Then X is F − B[0,∞) measurable.

Proof. Note that X̂t ◦ X = Xt and Xt is F − B measurable by assumption. The result follows
from Lemma 6.1.3.
6.1 Definition and existence 193

If X = (Xt )t≥0 is a stochastic process, we can consider the distribution X(P ) on (R[0,∞) , B[0,∞) .
For determining such a distribution, the following lemma will be useful.

Lemma 6.1.5. Define H as the family of sets on the form

{x ∈ R[0,∞) | (xt1 , . . . , xtn ) ∈ Bn } = (X̂t1 , . . . , X̂tn ) ∈ Bn ,

where n ∈ N, 0 ≤ t1 < · · · < tn and Bn ∈ Bn . Then H is a generating family for B[0,∞)

which is stable under finite intersections.

Proof. It is immediate (but notationally heavy) to see that H is stable under finite inter-

sections. Let F = (X̂t1 , . . . , X̂tn ) ∈ Bn ∈ H and note that the vector (X̂t1 , . . . , X̂tn ) is
B[0,∞) − Bn measurable, so F ∈ B[0,∞) . Therefore H ⊆ B[0,∞) , so also σ(H) ⊆ B[0,∞) . For the
converse inclusion, note that for all t ≥ 0 and B ∈ B it holds that X̂t−1 (B) = (X̂t ∈ B) ∈ H,
so each coordinate projection must be σ(H) − B measurable. As B[0,∞) is the smallest σ–
algebra with this property, we conclude that B[0,∞) ⊆ σ(H). All together we have the desired
result B[0,∞) = σ(H).

If P̂ is a probability on (R[0,∞) , B[0,∞) ) then

(n)
Pt1 ...tn (Bn ) = P̂ ((X̂t1 , . . . , X̂tn ) ∈ Bn ) (6.1)
(n)
defines a probability on (Rn , Bn ) for all n ∈ N, t1 < · · · < tn . The class of all Pt1 ...tn is the
class of finite-dimensional distributions for P̂ .

(n)
If X is a real stochastic process with distribution P̂ then Pt1 ...tn given by (6.1) is the distribu-
(n)
tion of (Xt1 , . . . , Xtn ) and the class (Pt1 ...tn ) is called the class or family of finite-dimensional
distributions for X.

From Lemma 6.1.5 and Theorem A.2.4, it follows that a probability P̂ on (R[0,∞) , B[0,∞) )
is uniquely determined by the finite-dimensional distributions. The main result concerning
the construction of stochastic processes, Kolmogorov’s consistency theorem, gives a simple
condition for when a given class of finite-dimensional distributions is the class of finite-
dimensional distributions for one (and necessarily only one) probability on (R[0,∞) , B [0,∞) ).

With P̂ a probability on (R[0,∞) , B[0,∞) ), it is clear that the finite-dimensional distributions

for P̂ fit together in the following sense: the P̂ -distribution of (X̂t1 , . . . , X̂tn ) can be obtained
as the marginal distribution of (X̂u1 , . . . , X̂um ) for any choice of m and 0 ≤ u1 < . . . < um
such that {t1 , . . . , tn } ⊆ {u1 , . . . , um }. In particular, a class of finite-dimensional distributions
194 The Brownian motion

must always fulfil the following consistency condition: for all n ∈ N, 0 ≤ t1 < . . . < tn+1 and
all k ,1 ≤ k ≤ n + 1, we have

(n) (n+1)
Pt1 ...tk−1 tk+1 ...tn+1 = πk (Pt1 ...tn+1 ), (6.2)

where πk : Rn+1 → Rn is given by

πk (y1 , . . . , yn+1 ) = (y1 , . . . , yk−1 , yk+1 , . . . , yn+1 ).

(n)
If X = (Xt )t≥0 has distribution P̂ with finite-dimensional distributions (Pt1 ...tn ), then (6.2)
merely states that the distribution of (Xt1 , . . . , Xtk−1 , Xtk+1 , . . . , Xtn+1 ), is the marginal dis-
tribution in the distribution of (Xt1 , . . . , Xtn+1 ) which is obtained by excluding Xtk .

We will without proof use

(n)
Theorem 6.1.6 (Kolmogorov’s consistency theorem). If P = (Pt1 ...tn ) is a family of finite-
dimensional distributions, defined for n ∈ N, 0 ≤ t1 < · · · < tn , which satisfies the consistency
condition (6.2), then there exists exactly one probability P̂ on (R[0,∞) , B[0,∞) ) which has P
as its family of finite-dimensional distributions.

We shall use the consistency theorem to prove the existence of a Brownian motion, that is
defined by

Definition 6.1.7. A real stochastic process X = (Xt )t≥0 defined on a probability space
(Ω, F, P ) is a Brownian motion with drift ξ ∈ R and variance σ 2 > 0, if the following three
conditions are satisfied

(1) P (X0 = 0) = 1.

(2) For all 0 ≤ s < t the increments Xt −Xs are normally distributed N ((t−s)ξ, (t−s)σ 2 )).

(3) The increments Xt1 = Xt1 − X0 , Xt2 − Xt1 , . . . , Xtn − Xtn−1 are for all n ∈ N and
0 ≤ t1 < · · · < tn mutually independent.

Definition 6.1.8. A normalised Brownian motion is a Brownian motion with drift ξ = 0

and variance σ 2 = 1.

Theorem 6.1.9. For any ξ ∈ R and σ 2 > 0 there exists a Brownian motion with drift ξ and
variance σ 2 .
6.1 Definition and existence 195

Proof. We shall use Kolmogorov’s consistency theorem. The finite dimensional distributions
for the Brownian motion are determined by (1)–(3):
Let 0 ≤ t1 < · · · < tn+1 . Then we know that

Xt1 , Xt2 − Xt1 , . . . , Xtn+1 − Xtn

are independent and normally distributed. Then the vector

(Xt1 , Xt2 − Xt1 , . . . , Xtn+1 − Xtn ) (6.3)

is n + 1–dimensional normally distributed. Since

(Xt1 , Xt2 , . . . , Xtn+1 ) (6.4)

is a linear transformation of (6.3), then (6.4) is n+1-dimensional normally distributed as well.

The distribution of such a normal vector is determined by the mean vector and covariance
matrix. We have
E(Xt ) = E(Xt − X0 ) = (t − 0)ξ = tξ
and for s ≤ t

Cov(Xs , Xt ) = Cov(Xs , Xs + (Xt − Xs ))

= V(Xs ) + Cov(Xs , Xt − Xs )
= V(Xs ) + 0 = V(Xs − X0 ) = (s − 0)σ 2 = sσ 2 .

We have shown that the finite-dimensional distributions of a Brownian motion with drift ξ
and variance σ 2 are given by

t1 σ 2 t1 σ 2 t1 σ 2 t1 σ 2
   
t1 ξ ···
 t ξ  
 2   t1 σ 2 t2 σ 2 t2 σ 2 ··· t2 σ 2 

(n+1) t ξ t1 σ 2 t2 σ 2 t3 σ 2 ··· t3 σ 2
   
Pt1 ,...,tn+1 = N  3
 , 
.. .. .. .. ..
  
   
 .   . . . . 
tn+1 ξ t1 σ 2 t2 σ 2 t3 σ 2 ··· tn+1 σ 2
(n+1)
Finding πk (Pt1 ...tn+1 ) (cf. (6.2)) is now simple: the result is an n-dimensional normal dis-
tribution, where the mean vector is obtained by deleting the k’th entry in the mean vector
(n+1)
for Pt1 ...tn+1 , and the covariance matrix is obtained by deleting the k’th row and the k’th
(n+1)
column of the covariance matrix for Pt1 ...tn+1 . It is immediately seen that we thus ob-
(n)
tain Pt1 ...tk−1 tk+1 ...tn+1 , so by the consistency theorem there is exactly one probability P̂ on
(R[0,∞) , B [0,∞) ) with finite-dimensional distributions given by the normal distribution above.
With this probability measure P̂ , the process consisting of all the coordinate projections
X̂ = (X̂t )t≥0 becomes a Brownian motion with drift ξ and variance σ 2 .
196 The Brownian motion

The following lemma will be useful in Section 6.2:

Lemma 6.1.10. Assume that X = (Xt ) is a Brownian Motion with drift ξ and variance σ 2 .
Let u ≥ 0. Then
D
(Xs )s≥0 = (Xu+s − Xu )s≥0

Proof. We will show that the two processes have the same finite–dimensional distributions.
So let 0 ≤ t1 < · · · < tn . Then we show
D
(Xt1 , . . . , Xtn ) = (Xt1 +u − Xu , Xt2 +u − Xu , . . . , Xtn +u − Xu ) . (6.5)

In the proof of Theorem 6.1.9 we obtained that

(Xt1 , . . . , Xtn )

is n–dimensional normally distributed, since it is a linear transformation of

(Xt1 − X0 , Xt2 − Xt1 , . . . , Xtn − Xtn−1 )

where the coordinates are independent and normally distributed. In the exact same way we
can see that
(Xu+t1 − Xu , . . . , Xu+tn − Xu )
is n–dimensional normally distributed, since it is a linear transformation of

(Xu+t1 − Xu , Xu+t2 − Xu+t1 , . . . , Xu+tn − Xu+tn−1 )

that have independent and normally distributed coordinates. So both of the vectors in
(6.5) are normally distributed. To see that the two vectors have the same mean vector and
covariance matrix, it suffices to show that for 0 ≤ s

EXs = E(Xu+s − Xu )

and for 0 ≤ s1 < s2

Cov(Xs1 , Xs2 ) = Cov(Xu+s1 − Xu , Xu+s2 − Xu ) .

We obtain
E(Xu+s − Xu ) = EXu+s − EXu = ξ(u + s) − ξu = ξs = EXs
and

Cov(Xu+s1 − Xu , Xu+s2 − Xu )
=Cov(Xu+s1 − Xu , Xu+s1 − Xu + Xu+s2 − Xu+s1 )
=Cov(Xu+s1 − Xu , Xu+s1 − Xu ) + Cov(Xu+s1 − Xu , Xu+s2 − Xu+s1 )
=V (Xu+s1 − Xu ) = σ 2 s1 = Cov(Xs1 , Xs2 )
6.2 Continuity of the Brownian motion 197

6.2 Continuity of the Brownian motion

In the previous section we saw how it is possible using Kolmogorov’s consistency theorem
to construct probabilities on the function space (R[0,∞) , B[0,∞) ). Thereby we also obtained a
construction of stochastic processes X = (Xt )t≥0 with given finite-dimensional distributions.
However, if one aims to construct processes (Xt ), which are well-behaved when viewed as
functions of t, the function space (R[0,∞) , B[0,∞) ) is much too large, as we shall presently see.
Let X = (Xt )t≥0 be a real process, defined on (Ω, F, P ). The sample paths of the process
are those elements
t → Xt (ω)

in R[0,∞) which are obtained by letting ω vary in Ω. One might then be interested in
determining whether (almost all) the sample paths are continuous, i.e., whether

P (X ∈ C[0,∞) ) = 1 ,

where C[0,∞) ⊆ R[0,∞) is the set of continuous x : [0, ∞) → R. The problem is, that C[0,∞)
is not in B[0,∞) !

We will show this by finding two B[0,∞) –measurable processes X and Y defined on the same
(Ω, F, P ) and with the same finite dimensional distributions, but such that all sample paths
for X are continuous, and all sample paths for Y are discontinuous in all t ≥ 0. The processes
X and Y are constructed in Example 6.2.1. The existence of such processes X and Y gives
that
(X ∈ C[0,∞) ) = Ω (Y ∈ C[0,∞) ) = ∅

and if C[0,∞) was measurable the identical distributions would lead to

P (X ∈ C[0,∞) ) = P (Y ∈ C[0,∞) ) ,

which is a contradiction!

Example 6.2.1. Let U be defined on (Ω, F, P ) and assume that U has the uniform distri-
bution on [0, 1].

Define
Xt (ω) = 0 for all ω ∈ Ω, t ≥ 0

and (
0, if U (ω) − t is irrational
Yt (ω) = .
1, if U (ω) − t is rational
198 The Brownian motion

The finite dimensional distributions of X are degenerated

P (Xt1 = · · · = Xtn = 0) = 1

for all n ∈ N and 0 ≤ t1 < · · · < tn . For Y we have

P (Yt = 1) = P (U − t ∈ Q) = 0

so P (Yt = 0) = 1 and thereby also

P (Yt1 = · · · = Ytn = 0) = 1

This shows that X and Y have the same finite dimensional distributions. ◦

Thus constructing a continuous process will take more than distributional arguments. In
the following we discuss a concrete approach that leads to the construction of a continuous
Brownian motion.

Definition 6.2.2. If the processes X = (Xt )t≥0 and Y = (Yt )t≥0 are both defined on
(Ω, F, P ), then we say that Y is a version of X if

P (Yt = Xt ) = 1

for all t ≥ 0.

We see that Definition 6.2.2 is symmetric: If Y is a version of X, then X is also a version of

Example 6.2.3. With X and Y as in Example 6.2.1 from above, we have

(Yt = Xt ) = (Yt = 0)

and since we have seen that P (Yt = 0) = 1, then Y is a version of X. ◦

Theorem 6.2.4. If Y is a version of X, then Y has the same distribution as X.

Proof. The idea is to show that Y and X have the same finite–dimensional distributions:
With t1 < · · · < tn we have P (Ytk = Xtk ) = 1 for k = 1, . . . , n. Then also
n
\
P ((Yt1 , . . . , Ytn ) = (Xt1 , . . . , Xtn )) = P (Ytk = Xtk ) = 1
k=1
6.2 Continuity of the Brownian motion 199

The aim is to show that there exists a continuous version of the Brownian motion. Define
for n ∈ N
Cn◦ = {x ∈ R[0,∞) : x is uniformly continuous on [0, n] ∩ Q}
and
∞
\
◦
C∞ = Cn◦
n=1
◦
Lemma 6.2.5. If x ∈ C∞ then there exists a uniquely determined continuous function
[0,∞)
y∈R such that yq = xq for all q ∈ [0, ∞) ∩ Q.

◦
Proof. Let x ∈ C∞ and t ≥ 0. Then choose n such that n > t. We have that x ∈ Cn◦ , so x is
uniformly continuous on [0, n] ∩ Q. That means

∀ > 0 ∃δ > 0 ∀q1 , q2 ∈ [0, n] ∩ Q : |q1 − q2 | < δ ⇒ |xq1 − xq2 | <

Choose a sequence (qk ) ⊆ [0, n]∩Q with qk → t. Then in particular (qk ) is a Cauchy sequence.
The uniform continuity of x gives that xqk is a Cauchy sequence as well: Let > 0 and find
the corresponding δ > 0. We can find K ∈ N such that for all m, n ≥ K it holds

|qm − qn | < δ

But then we must have that

|xqm − xqn | <
if only m, n ≥ K. This shows, that (xqk ) is Cauchy, and therefore the limit yt = limk→∞ xqk
exists in R. We furthermore have, that the limit yt does not depend on the choice of (qk ):
Let (q̃k ) ⊆ [0, n] ∩ Q be another sequence with q̃k → t. Then

|q̃k − qk | → 0 as k → ∞

and this yields (using the uniform continuity again) that

|xq̃k − xqk | → 0 as k → ∞

so lim xq̃k = lim xqk .

For all t ∈ Q we see that yt = xt , since the continuity of x in t gives limk→∞ xqk = xt .

Finally we have, that y is continuous in all t ≥ 0: Let t ≥ 0 and > 0 be given, and find
δ > 0 according to the uniform continuity. Now choose t0 with |t0 − t| < δ/2. Assume that
qk → t and qk0 → t0 . We can find K ∈ N such that |qk0 − qk | < δ for k ≥ K. Then

|xqk0 − xqk | <

200 The Brownian motion

for all k ≥ K, and thereby we obtain that

|yt0 − yt | ≤ .

This shows the desired continuity of y in t.

It is a critical assumption, that the continuity is uniform. Consider x given by

xt = 1[√2,∞)

Then x is continuous on [0, n] ∩ Q, but an y does not exist with the required properties.

◦
We obtain, that C∞ ∈ B[0,∞) since
∞ [
∞
\ \ n 1 o
Cn◦ = x ∈ R[0,∞) : |xq2 − xq1 | < ,
1
M
M =1 N =1 q1 ,q2 ∈[0,n]∩Q,|q1 −q2 |≤
N

which is a B[0,∞) –measurable set, since

n 1 o 1
x ∈ R[0,∞) : |xq2 − xq1 | < = |X̂q2 − X̂q1 | < ,
M M

where X̂q1 , X̂q2 : R[0,∞) → R are both B[0,∞) − B–measurable.

Definition 6.2.6. A real process X = (Xt )t≥0 is continuous in probability if for all t ≥ 0
P
and all sequences (tk )k∈N with tk ≥ 0 and tk → t it holds that Xtk −→ Xt

Theorem 6.2.7. Let X = (Xt )t≥0 be a real process which is continuous in probability. If

◦
P (X ∈ C∞ ) = 1,

then there exists a version Y of X which is continuous.

◦
Proof. Let F = (X ∈ C∞ ). Assume that ω ∈ F . According to Lemma 6.2.5 there exists a
uniquely determined continuous function t 7→ Yt (ω) such that

Yq (ω) = Xq (ω) for all q ∈ [0, ∞) ∩ Q . (6.6)

Furthermore we must have for each t ≥ 0 that a rational sequence (qk ) can be chosen with
qk → t. Then using the continuity of t → Yt (ω) and the property in (6.6) yields that for all
ω∈F
Yt (ω) = lim Yqk (ω) = lim Xqk (ω) .
k→∞ k→∞
6.2 Continuity of the Brownian motion 201

If we furthermore define Yt (ω) = 0 for ω ∈ F c , then we have

Yt = lim 1F Xqk .
k→∞

Since all 1F Xqk are random variables (measurable), then Yt is a random variable as well.
And since t ≥ 0 was chosen arbitrarily, then Y = (Yt )t≥0 is a continuous real process (for
ω ∈ F c we chose (Yt (ω)) to be constantly 0 – which is a continuous function) that satisfies.

P (Yq = Xq ) = 1 for all q ∈ [0, ∞) ∩ Q ,

since P (F ) = 1 and Yq (ω) = Xq (ω) when ω ∈ F .

We still need to show, that Y is a version of X. So let t ≥ 0 and find a rational sequence
(qk ) with qk → t. Since X is assumed to be continuous in probability we must have
P
Xqk −→ Xt
a.s.
and since we have Yqk = Xqk it holds
P
Yqk −→ Xt .

From the (true) continuity we have the convergence (for all ω ∈ Ω)

Yqk → Yt

Then
P (Yt = Xt ) = 1 .
as desired.

Theorem 6.2.8. Let X = (Xt )t≥0 be a Brownian motion with drift ξ and variance σ 2 > 0.
Then X has a continuous version.

Proof. It is sufficient to consider the normalized case, where ξ = 0 and σ 2 = 1. For a general
choice of ξ and σ 2 we have that
Xt − ξt
X̃t =
σ
is a normalized Brownian motion. And obviously, (Xt )t≥0 is continuous if and only if (X̃t )t≥0
is continuous.

So let X = (Xt )t≥0 be a normalized Brownian motion. Firstly, we show that X is continuous
in probability. For all 0 ≤ s < t we have

Xt − Xs ∼ N (0, t − s)
202 The Brownian motion

such that
1
√ (Xt − Xs ) ∼ N (0, 1)
t−s
Then for > 0 we have
1
P (|Xt − Xs | > ) = P √ |Xt − Xs | > √
t−s t−s
Z −√ Z ∞
t−s 1 2 1 2
= √ e−u /2 du + √ e−u /2 du
−∞ 2π √
t−s
2π
Z ∞
1 2
=2 √ e−u /2 du
√
t−s
2π

For general s, t ≥ 0 with s 6= t we clearly have

Z ∞
1 2
P (|Xt − Xs | > ) = 2 √ e−u /2 du
√ 2π
|t−s|

and this decreases to 0 as |t − s| → 0. Hence in particular, we have for tk → t that

P (|Xt − Xtk | > ) → 0 as k → ∞

which demonstrates the continuity in probability.

The following, that is actually a stronger version of the continuity in probability, will be
useful. It holds for all > 0 that
1
lim P (|Xh | > ) = 0 . (6.7)
h↓0 h

This follows from Markov’s inequality, since

1 1 1 h 1 3h
P (|Xh | > ) = P (Xh4 > 4 ) ≤ 4 EXh4 = 4 E( √ Xh )4 = 4
h h h h

which has limit 0, as h → 0. In the last equality we used that √1 Xh is N (0, 1)–distributed
h
and that the N (0, 1)–distribution has fourth moment = 3.

It is left to show that

◦
P (X ∈ C∞ )=1

and for this it suffices to show that

P (X ∈ Cn◦ ) = 1
6.2 Continuity of the Brownian motion 203

for all n ∈ N, recalling that a countable intersection of sets with probability 1 has probability
1. We show this for n = 1 (higher values of n would not change the argument, only make
the notation more involved). Define

1
VN = sup |Xq0 − Xq | : q, q 0 ∈ Q ∩ [0, 1], |q 0 − q| ≤ N .
2
Then VN decreases as N → ∞, so
∞ [
∞
\ \ 1
(X ∈ C1◦ ) = |Xq2 − Xq1 | ≤
1
M
M =1 N =1 q1 ,q2 ∈[0,1]∩Q,|q2 −q1 |≤
N
∞ [
∞
\ 1
= VN ≤ = ( lim VN = 0)
M N →∞
M =1 N =1

Hence we need to show that P (limN →∞ VN = 0) = 1. Since we already know, that VN is

P
decreasing, it will be enough to show that VN −→ 0. So we need to show that for any > 0,
P (VN > ) → 0 as N → ∞.

For this define for N ∈ N and k = 1, . . . , 2N

Yk,N = sup{|Xq − X k−1 | | q ∈ Jk,N } ,

where hk − 1 k i
Jk,N = , ∩ Q.
2N 2N
If we can show that

(1) VN ≤ 3 max{Yk,N | 1 ≤ k ≤ 2N }

(2) P (Yk,N > y) = P (Y1,N > y) ≤ 2P (|X 1 | > y)

then we obtain
 N 
2
[
P (VN > ) ≤ P max Yk,N > =P Yk,N > 
k=1,...,2N 3 3
k=1
N
2
X
≤ P Yk,N > = 2N P Y1,N > ≤ 2N +1 P |X 1N | >
3 3 2 3
k=1

which has limit 0 as N → ∞ because of (6.7). The first inequality is according to (1), the
second inequality follows from Boole’s inequality, and the last inequality is due to (2). Hence,
the proof is complete if we can show (1) and (2).
204 The Brownian motion

For (1): Consider for some fixed N ∈ N the q, q 0 that are used in the definition of VN . Hence
q < q 0 ∈ Q ∩ [0, 1] where |q 0 − q| ≤ 21N . We have two possibilities:

Either q, q 0 belong to the same Jk,N such that

|Xq0 − Xq | = |Xq0 − X k−1 + X k−1 − Xq |

2N 2N

≤ |Xq0 − X k−1 | + |Xq − X k−1 |

2N 2N

≤ 2Yk,N
≤ 2 max{Yk,N | 1 ≤ k ≤ 2N } ,

or q ∈ Jk−1,N and q 0 ∈ Jk,N . Then

|Xq0 − Xq | = |Xq0 − X k−1 + X k−1 − X k−2 + X k−2 − Xq |

2N 2N 2N 2N

≤ |Xq0 − X k−1 | + |X k−1 − X k−2 | + |Xq − X k−2 |

2N 2N 2N 2N

≤ Yk,N + 2Yk−1,N
≤ 3 max{Yk,N | 1 ≤ k ≤ 2N } .

In any case we have

|Xq0 − Xq | ≤ 3 max{Yk,N | 1 ≤ k ≤ 2N } ,

where the right hand side does not depend on q, q 0 . Property (1) follows from taking the
supremum.

For (2): Note that for all k = 2, . . . , 2N , the variable Yk,N is calculated from the process

(X k−1 +s − X k−1 )s≥0

2N 2N

in the exact same way as Y1,N is calculated from the process (Xs )s≥0 .

Also note that because of the Lemma 6.1.10 the two processes

(X k−1 +s − X k−1 )s≥0 and (Xs )s≥0

2N 2N

D
have the same distributions. Then also Yk,N = Y1,N for all k = 2, . . . , 2N , such that in
particular
P (Yk,N > y) = P (Y1,N > y) ≤ 2P (|X 1N | > y)
2

for all y > 0. The inequality comes from Lemma 6.2.9 below, since J1,N is countable with
J1,N ⊆ [0, 21N ].
6.2 Continuity of the Brownian motion 205

Lemma 6.2.9. Let X = (Xt )t≥0 be a normalized Brownian Motion and let D ⊆ [0, t0 ] be an
at most countable set. Then it holds for x > 0 that

P (sup Xt > x) ≤ 2P (Xt0 > x)

t∈D

P (sup |Xt | > x) ≤ 2P (|Xt0 | > x)

t∈D

Proof. First assume that D is finite such that D = {t1 , . . . , tn }, where 0 ≤ t1 < · · · < tn ≤ t0 .
Define
τ = min{k ∈ {1, . . . , n} | Xtk > x}

and let τ = n if Xtk ≤ x for all k = 1, . . . , n. Then

n−1
X
P (sup Xt > x) = P (τ = k) + P (τ = n, Xtn > x)
t∈D
k=1

Let k ≤ n − 1. Then
k−1
\
(τ = k) = (Xtj ≤ x) ∩ (Xtk > x)
j=1

and note that (Xt1 , . . . , Xtk ) ⊥

⊥ (Xtn − Xtk ), so in particular

(τ = k) ⊥
⊥ (Xtn − Xtk ) .

Furthermore (Xtn − Xtk ) ∼ N (0, tn − tk ) so P (Xtn − Xtk > 0) = 21 . Hence

P (τ = k) = 2P (τ = k)P (Xtn − Xtk > 0) = 2P (τ = k, Xtn − Xtk > 0)

= 2P (τ = k, Xtn > Xtk ) ≤ 2P (τ = k, Xtn > x) ,

where it is used that Xtk > x on (τ = k). Then

n−1
X
P (sup Xt > x) = P (τ = k) + P (τ = n, Xtn > x)
t∈D
k=1
n−1
X
≤2 P (τ = k, Xtn > x) + 2P (τ = n, Xtn > x)
k=1

= 2P (Xtn > x) ≤ 2P (Xt0 > x) .

In the last inequality it is used that tn ≤ t0 such that Xt0 have a larger variance than Xtn
(both variables have mean 0).
206 The Brownian motion

Thereby we have shown the first result in the case, where D is finite. To obtain the second
result for a finite D, consider the process −X = (−Xt )t≥0 , which is again a normalised
Brownian motion. Hence for x > 0 we have

P ( inf Xt < −x) = P (sup(−Xt ) > x) ≤ 2P (−Xt0 > x) = 2P (Xt0 < −x)
t∈D t∈D

so we can obtain

P (sup |Xt | > x) = P (sup Xt > x) ∪ ( inf Xt < −x)
t∈D t∈D t∈D

≤ P (sup Xt > x) + P ( inf Xt < −x)

t∈D t∈D

≤ 2P (Xt0 > x) + 2P (Xt0 < −x)

= 2P (|Xt0 | > x)

Then we have also shown the second result, when D is finite.

For a general D find a sequence (Dn ) of finite subsets of D where Dn ↑ D. Then the two
inequalities holds for each Dn . Since furthermore

( sup Xt > x) ↑ (sup Xt > x)

t∈Dn t∈D

( sup |Xt | > x) ↑ (sup |Xt | > x)

t∈Dn t∈D

the continuity of the probability measure P yields that

P (sup Xt > x) = lim P ( sup Xt > x) ≤ 2P (Xt0 > x)

t∈D n→∞ t∈Dn

P (sup |Xt | > x) = lim P ( sup |Xt | > x) ≤ 2P (|Xt0 | > x)

t∈D n→∞ t∈Dn

which completes the proof of the lemma.

6.3 Variation and quadratic variation

In this and the subsequent section we study the sample paths of a continuous Brownian
motion. In this framework it will be useful to consider the space C[0,∞) consisting of all
6.3 Variation and quadratic variation 207

functions x ∈ R[0,∞) that are continuous. Like the projections on R[0,∞) we let X̃t denote
the coordinate projections on C[0,∞) , that is X̃t (x) = xt for all x ∈ C[0,∞) . Let C[0,∞) denote
the smallest σ–algebra, that makes all X̃t C[0,∞) − B–measurable

C[0,∞) = σ(X̃t | t ≥ 0) .

Similarly to what we have seen previously, C[0,∞) is generated by all the finite–dimensional
cylinder sets

C[0,∞) = σ (X̃t1 , . . . , X̃tn ) ∈ Bn | n ∈ N, 0 < t1 < · · · < tn , Bn ∈ Bn .

We demonstrated in Section 6.2 that there exists a process X defined on (Ω, F, P ) with
values in (R[0,∞) , B[0,∞) ) such that X is a Brownian motion X = (Xt ) and the sample paths
t 7→ Xt (ω) are continuous for all ω ∈ Ω. Equivalently, we have X(ω) ∈ C[0,∞) for all ω ∈ Ω,
so we can regard the process X as having values in C[0,∞) . That X is measurable with values
in (R[0,∞) , B[0,∞) ) means that X̂t (X) is F − B measurable for all t ≥ 0. But X̃t (X) = X̂t (X)
since X is continuous, so X̃t (X) is also F −B measurable for all t ≥ 0. Then X is measurable,
when regarded as a variable with values in (C[0,∞) , C[0,∞) ). The distribution X(P ) of X will
be a distribution on (C[0,∞) , C[0,∞) ), and this is uniquely determined by the behaviour on the

finite–dimensional cylinder sets on the form (X̃t1 , . . . , X̃tn ) ∈ Bn .

The space (C[0,∞) , C[0,∞) ) is significantly easier to deal with than (R∞ , B[0,∞) ), and a number
of interesting functionals become measurable on C[0,∞) , while they are not measurable on
R[0,∞) . For instance, for t > 0 we have that

M̃ = sup X̃s
s∈[0,t]

is a measurable function (a random variable) on (C[0,∞) , C[0,∞) ), which can be seen by

\ \
(M̃ ≤ y) = (X̃s ≤ y) = (X̃q ≤ y)
s∈[0,t] q∈[0,t]∩Q

where the last intersection is countable – hence measurable. For the last equality, the inclusion
’⊆’ is trivial. For the converse inclusion, assume that
\
x∈ (X̃q ≤ y) .
q∈[0,t]∩Q

Then xq ≤ y for all q ∈ [0, t] ∩ Q. Let s ∈ [0, t] and find a rational sequence qn → s. Then
xs = limn→∞ xqn ≤ y and since s was arbitrarily chosen, it holds that
\
x∈ (X̃s ≤ y) .
s∈[0,t]

We will define various concepts that can be used to describe the behaviour of the sample
paths of a process.
208 The Brownian motion

Definition 6.3.1. Let x ∈ C[0,∞) . We say that x is nowhere monotone if it for all 0 ≤ s < t
holds that x is neither increasing nor decreasing on [s, t]. Let S ⊆ C[0,∞) denote the set of
nowhere monotone functions.

Related to the set S we define Mst to be the set of functions, which are either increasing or
decreasing on the interval [s, t]:
∞
\
Mst = {x ∈ C[0,∞) | xtkN ≥ xtk−1,N , 1 ≤ k ≤ 2N }
N =1
∞
\
∪ {x ∈ C[0,∞) | xtkN ≤ xtk−1,N , 1 ≤ k ≤ 2N } ,
N =1

k
where tkN = s + 2N
(t − s) for 0 ≤ k ≤ 2N . We note that Mst ∈ C[0,∞) , since e.g.

{x ∈ C[0,∞) | xtkN ≤ xtk−1,N , 1 ≤ k ≤ 2N } = X̃tkN ≤ X̃tk−1,N , 1 ≤ k ≤ 2N .

Since x ∈ S c if and only if there exists intervals with rational endpoints where x is monotone,
then we can write
[
Sc = Mq1 q2 ,
0≤q1 <q2
q1 ,q2 ∈Q

which shows that S ∈ C[0,∞) . We shall see later that P (X ∈ S) = 1 for a continuous Brownian
motion X.

Definition 6.3.2. Let x ∈ C[0,∞) and 0 ≤ s < t. The variation of x on [s, t] is defined as
n
X
Vst (x) = sup |xtk − xtk−1 | ,
k=1

where sup is taken over all finite partitions s ≤ t0 < · · · < tn ≤ t of [s, t].

The variation has some simple properties:

Lemma 6.3.3. Let x, y ∈ C[0,∞) , c ∈ R, 0 ≤ s < t and [s, t] ⊆ [s0 , t0 ]. Then it holds that

(1) Vst (x) ≤ Vs0 t0 (x) .

(2) Vst (cx) = |c|Vst (x) .

(3) Vst (x + y) ≤ Vst (x) + Vst (y) .

6.3 Variation and quadratic variation 209

Proof. The first statement is because the sup in Vs0 t0 (x) is over more partitions than the sup
in Vst (x). For the second result we have
n
X n
X
Vst (cx) = sup |cxtk − cxtk−1 | = |c| sup |xtk − xtk−1 | = |c|Vst (x) ,
k=1 k=1

and the third property follows from

= Vst (x) + Vst (y)

Furthermore we have situations, where the variation is particularly simple

Lemma 6.3.4. If x ∈ C[0,∞) is monotone on [s, t] then

Vst = |xt − xs |

Proof. If x is monotone on [s, t] then

n
X
|xtk − xtk−1 | = |xtn − xt0 |
k=1

for any partition s ≤ t0 < · · · < tn ≤ t of [s, t].

Let x ∈ C[0,∞) and assume that s ≤ tk−1 < tk ≤ t are given. If (qn ), (rn ) ⊆ [s, t] are rational
sequences with qn → tk−1 and rn → tk , it holds due to the continuity of x that

lim |xrn − xqn | = |xtk − xtk−1 | .

n→∞

This shows that all partitions can be approximated arbitrarily well by rational partitions,
so the sup in the definition of Vst needs only to be over all rational partitions. Hence Vst is
C[0,∞) − B measurable.
210 The Brownian motion

Definition 6.3.5. (1) Let x ∈ C[0,∞) and 0 ≤ s < t. Then x is of bounded variation on
[s, t] if Vst (x) < ∞. The set of functions of bounded variation on [s, t] is denoted

Fst = {x ∈ C[0,∞) | Vst (x) < ∞}

c
(2) Let x ∈ C[0,∞) . Then x is everywhere of unbounded variation, if x ∈ Fst for all
c
0 ≤ s < t. Let G = ∩0≤s<t Fst denote the set of continuous functions, which are everywhere
of unbounded variation.

Since Vst is C[0,∞) − B measurable we observe that Fst ∈ C[0,∞) . Furthermore we can rewrite
G as \
G= Fqc1 ,q2 ,
0≤q1 <q2
q1 ,q2 ∈Q

which shows that G ∈ C[0,∞) . The equality above is a direct consequence of (1) in Lemma
6.3.3.

The following lemmas shows which type of continuous functions have bounded variation.

Lemma 6.3.6. Let x ∈ C[0,∞) . Then x ∈ Fst if and only if x on [s, t] has the form

x = y − ỹ ,

where both y and ỹ are increasing.

Proof. If x has the form x = y − ỹ on [s, t], where both y and ỹ are increasing, then using
Lemma 6.3.3 yields

Vst (x) = Vst (y − ỹ) ≤ Vst (y) + Vst (−ỹ) = Vst (y) + Vst (ỹ) = |yt − ys | + |ỹt − ỹs |

which is finite. Conversely, assume that Vst (x) < ∞ and define for u ∈ [s, t]
1 1
yu = (xu + Vsu (x)) and ỹu = (−xu + Vsu (x)) .
2 2
Then x = y − ỹ and furthermore we have, that e.g. u → yu is increasing: If xu+h ≥ xu , then
yu+h ≥ yu since always Vs,u+h (x) ≥ Vs,u (x). If xu+h < xu , then
0
n
X
Vs,u (x) + |xu+h − xu | = sup
˜ |xsj − xsj−1 | + |xu+h − xu |
j=1
Xn
≤ sup |xtk − xtk−1 | = Vs,u+h (x) ,
k=1
6.3 Variation and quadratic variation 211

˜ is over all partitions s ≤ s0 < · · · < sn0 ≤ u and sup is over all partitions
where sup
s ≤ t0 < · · · < tn ≤ u + h. Hence we have seen that
1 1
yu = xu+h + |xu+h − xu | + Vsu (x) ≤ xu+h + Vs,u+h (x) = yu+h .
2 2

Corollary 6.3.7. Let x ∈ C[0,∞) . If x ∈ G then x ∈ S.

Proof. Assume that x ∈ S c . Then there exists s < t such that x is monotone on [s, t]. But
then x has the form x − 0 on [s, t], where both x and 0 are increasing. Thus x ∈ Fst , so
x ∈ Gc .

Lemma 6.3.8. If x ∈ C[0,∞) is continuously differentiable on [s, t], then x ∈ Fst .

where each uk ∈ [s, t] is chosen according to the mean value theorem.

Definition 6.3.9. Let x ∈ C[0,∞) . The quadratic variation of x ∈ C[0,∞) on [s, t] is defined
as X
Qst (x) = lim sup (x kN − x k−1 )2 .
N →∞ 2 2N
k∈N
s≤ k−1 < kN ≤t
2N 2

We observe that Qst is C[0,∞) − B measurable and that

Qst (x) ≤ Qs0 t0 (x) , (6.8)

if [s, t] ⊆ [s0 , t0 ].

Lemma 6.3.10. For x ∈ C[0,∞) we have

Vst (x) < ∞ ⇒ Qst (x) = 0 ,

or equivalently
Qst (x) > 0 ⇒ Vst (x) = ∞ .
212 The Brownian motion

Proof. For N ∈ N define

k−1 k
KN = max{|x k − x k−1 | | k ∈ N, s ≤ < N ≤ t} .
2N 2N 2N 2
Since x is uniformly continuous on the compact interval [s, t], we will have KN → 0 as
N → ∞. Furthermore
X
Qst (x) = lim sup (x kN − x k−1 )2
N →∞ 2 2N
k∈N
s≤ k−1 < kN ≤t
2N 2
X
≤ lim sup KN |x k − x k−1 |
N →∞ 2N 2N
k∈N
s≤ k−1 k
N < N ≤t
2 2

≤ lim sup Vst (x)KN ,

N →∞

from which the lemma follows.

The main result of the section is the following theorem which describes exactly how ”wild”
the sample paths of the Brownian motion behaves.

Theorem 6.3.11. If X = (Xt )t≥0 is a continuous Brownian motion with drift ξ and variance
σ 2 , then \
P (Qst (X) = (t − s)σ 2 ) = 1 .
0≤s<t

Before turning to the proof, we observe:

Corollary 6.3.12. If X = (Xt )t≥0 is a continuous Brownian motion with drift ξ and vari-
ance σ 2 , then X is everywhere of unbounded variation,

P (X ∈ G) = 1 .

Proof. Follows by combining Theorem 6.3.11 and Lemma 6.3.10.

Corollary 6.3.13. If X = (Xt )t≥0 is a continuous Brownian motion with drift ξ and vari-
ance σ 2 , then X is nowhere monotone,

P (X ∈ S) = 1 .

Proof. Is a result of Corollary 6.3.12 and Corollary 6.3.7.

6.3 Variation and quadratic variation 213

Proof of Theorem 6.3.11. Firstly we can note that

\ \
{x ∈ C[0,∞) : Qst (x) = (t−s)σ 2 } = {x ∈ C[0,∞) : Qq1 ,q2 (x) = (q2 −q1 )σ 2 }
0≤s<t 0≤q1 <q2 , q1 ,q2 ∈Q

The inclusion ⊆ is trivial. The converse inclusion ⊇ is argued as follows: Assume that x
is an element of the right hand side and let 0 ≤ s < t be given. Then we are supposed to
show that Qst (x) = (t − s)σ 2 . Let (qn1 ), (qn2 ), (rn1 ), (rn2 ) be rational sequences such that qn1 ↑ s,
qn2 ↓ t, rn1 ↓ s, and rn2 ↑ t. Then for all n ∈ N we must have [rn1 , rn2 ] ⊆ [s, t] ⊆ [qn1 , qn2 ] such
that because of (6.8) it holds that

Qrn1 ,rn2 (x) ≤ Qst (x) ≤ Qqn1 ,qn2 (x)

for all n ∈ N. By assumption

Qrn1 ,rn2 (x) = (rn2 − rn1 )σ 2 and Qqn1 ,qn2 (x) = (qn2 − qn1 )σ 2

for all n ∈ N, leading to

lim Qrn1 ,rn2 (x) = lim (rn2 − rn1 )σ 2 = (t − s)σ 2

n→∞ n→∞

lim Qqn1 ,qn2 (x) = lim (qn2 − qn1 )σ 2 = (t − s)σ 2

n→∞ n→∞

which combined with the inequality above gives the desired result that Qst (x) = (t − s)σ 2 .

Since the intersection above is countable, we only need to conclude that each of the sets in
the intersection has probability 1 in order to conclude the result. Hence it will suffice to show
that
P (Qst (X) = (t − s)σ 2 )

for given 0 ≤ s < t (we only need to show it for s, t ∈ Q, but that makes no difference in the
rest of the proof). Furthermore, we show the result for s = 0. The general result can be seen
in the exact same way, using even more notation.

Define for each n ∈ N

r
2n 1
Uk,n = 2
X kn − X k−1 − nξ .
σ 2 2n
2
For the increment in Uk,n we have
1 1 2
X k − X k−1 ∼N ξ, σ
2n n2 2n 2n
214 The Brownian motion

so each Uk,n ∼ N (0, 1). Furthermore for fixed n and varying k, the increments are indepen-
dent, such that also U1,n , U2,n , . . . are independent. We can write

[2n t]
X
Q0t (X) = lim sup (X k − X k−1 )2
2n n
2
n→∞
k=1
[2n t] r
X σ2 1 2
= lim sup U k,n + ξ
n→∞ 2n 2n
k=1
 n 
[2 t] 2 r
X σ σ 2 1 1
= lim sup  U2 + 2 Uk,n ξ + n ξ 2 
n→∞ 2n k,n 2n 2n 4
k=1
 n 
[2 t] 2 n
Xσ 2 [2 t]
U2 + √ ξUk,n + n ξ 2  .

= lim sup 
n→∞ 2n k,n σ 2 2n 4
k=1

Which gives

Q0t (X) − tσ 2
 n 
[2 t] 2 n
Xσ 2 [2 t]
U2 + √ ξUk,n − tσ 2 + n ξ 2 

= lim sup 
n→∞ 2n k,n σ 2 2n 4
k=1
 
[2n t] [2n t] [2n t]
X 1 2
= lim sup σ 2 U2 + √ ξUk,n − 1 + σ 2 − t + n ξ2

n→∞ 2n k,n σ 2 2n 2n 4
k=1

We have that
1 2n t − 1 [2n t] 2n t
t− = < ≤ =t
2n 2n 2n 2n
which shows that
[2n t] [2n t] 1 [2n t]
→ t and = →0
2n 4n 2n 2n
as n → ∞ So for deterministic part Q0t (X) − tσ 2 it holds

[2n t] [2n t] 2
σ2 ( − t) + ξ →0
2n 4n
a.s.
as n → ∞. Then the proof will be complete, if we can show that Sn −→ 0 where
[2n t]
X 1 2
2

Sn = n
Uk,n +√ ξUk,n − 1 .
2 2
σ 2n
k=1

Note that
[2n t]
X 1 2
2

ESn = EUk,n +√ ξEUk,n − 1 = 0
2n 2
σ 2n
k=1
6.4 The law of the iterated logarithm 215

2
since EUk,n = 1. According to Lemma 1.2.12 the convergence of Sn is obtained, if we can
show
∞
X
P (|Sn | > ) < ∞ (6.9)
n=1

for all > 0. Each term in this sum satisfies (using Chebychev’s Inequality)
1
P (|Sn | > ) = P (|Sn − ESn | > ) ≤ V (Sn )
2
such that the sum in (6.9) converges, if we can show
∞
X
V (Sn ) < ∞ (6.10)
n=1

Using that U1,n , U2,n are independent and identically distributed gives
[2n t]
X 1 2
2

V (Sn ) = V Uk,n +√ ξUk,n − 1
4n 2
σ 2 n
k=1
1 2
=[2n t] n V U1,n
2

+√ ξU1,n − 1
4 σ 2 2n
[2n t] 2 2 2
= n E U1,n +√ ξU1,n − 1
4 σ 2 2n
[2n t] 4 2 3 2 2 3 4
= n
EU1,n +√ ξEU1,n − EU1,n +√ ξEU1,n + 2 n ξ 2 EU1,n
2
4 2
σ 2 n 2
σ 2 n σ 2
!
2 2 2
+√ ξEU1,n − EU1,n − √ ξEU1,n + 1
σ 2 2n σ 2 2n
2 3 4
and since EU1,n = 0, EU1,n = 1, EU1,n = 0, and EU1,n = 3 we have

[2n t] 4ξ 2 [2n t] 4ξ 2
V (Sn ) = n
3−1+ 2 n −1+1 = n 2+ 2 n
4 σ 2 4 σ 2
2n t 4ξ 2 1 4ξ 2
≤ n 2 + 2 = nt 2 + 2
4 σ 2 2 σ 2
from which it is seen that the sum in (6.10) is finite.

6.4 The law of the iterated logarithm

In this section we shall show a classical result concerning Brownian motion which in more
detail describes the behaviour of the sample paths immediately after the start of the process.
216 The Brownian motion

If X = (Xt )t≥0 is a continuous Brownian motion we in particular have that lim Xt = X0 = 0

t→0
a.s. For a more precise description of the behaviour near 0, we seek a function h, increasing
and continuous on an interval of the form [0, t0 ) with h(0) = 0, such that

1 1
lim sup Xt , lim inf Xt (6.11)
t→0 h(t) t→0 h(t)

are both something interesting, i.e., finite and different from 0. A good guess for a sensible
h can be obtained by considering a Brownian motion without drift (ξ = 0), and using that
then
1
√ Xt
t
has the same distribution for all t > 0 which could be taken as an indication that √1t Xt
√
behaves sensibly for t → 0. But alas, h(t) = t is too small, although not much of an
adjustment is needed before (6.11) yields something interesting.

In the sequel we denote by φ the function

r
1
φ(t) = 2t log log , (6.12)
t

which is defined and finite for 0 < t < 1e . Since

1 log log t
lim t log log = lim =0
t→0 t t→∞ t
we have limt→0 φ(t) = 0 so it makes sense to define φ(0) = 0. Then φ is defined, non-negative
and continuous on [0, 1e ). We shall also need the useful limit
R∞ 2

x
e−u /2 du
lim 1 −x2 /2 =1 (6.13)
xe
x→∞

which follows from the following inequalities, that all hold for x > 0: since ux ≥ 1 for u ≥ x
we have Z ∞ Z ∞
2 u −u2 /2 1 2 1 −x2 /2
e−u /2 du ≤ e du = [−e−u /2 ]∞ x = e
x x x x x
u
and since x+1 ≤ 1 for x ≤ u ≤ x + 1 we have
Z ∞ Z x+1
2 u −u2 /2
e−u /2
du ≥ e du
x x x+1
1 2 2
= (e−x /2 − e−(x+1) /2 )
x+1
1 2 x
= e−x /2 (1 − e−x−1/2 ),
x x+1
6.4 The law of the iterated logarithm 217

and then (6.13) follows once we note that

x
lim (1 − e−x−1/2 ) = 1.
x→∞ x+1

Theorem 6.4.1 (The law of the iterated logarithm). For a continuous Brownian motion
X = (Xt )t≥0 with drift ξ and variance σ 2 > 0, it holds that
Xt Xt
P (lim sup √ = 1) = P (lim inf √ = −1) = 1 ,
t→0 σ 2 φ(t) t→0 σ 2 φ(t)
where φ is given by (6.12).

Proof. We show the theorem for X a continuous, normalised Brownian motion. Since
ξt
lim = 0,
t→0 φ(t)

it then holds that with X normalised,

σXt + ξt Xt
lim sup √ = lim sup = 1 a.s.
t→0 2
σ φ(t) t→0 φ(t)

and similarly for lim inf. Since (σXt + ξt)t≥0 has drift ξ and variance σ 2 , the theorem follows
for an arbitrary Brownian motion.

In the following it is therefore assumed that X is a continuous, normalised Brownian motion.

We show the theorem by showing the two following claims:
Xt
lim sup ≤ 1 + a.s. for all > 0 , (6.14)
t→0 φ(t)
Xt
lim sup ≥ 1 − a.s. for all > 0 . (6.15)
t→0 φ(t)
From (6.14) and (6.15) it immediately follows that
Xt
P lim sup =1 =1
t→0 φ(t)

and, applying this result to the normalised Brownian motion −X,

Xt
P lim inf = −1 = 1.
t→0 φ(t)

To show (6.14), let 0 < u < 1, put tn = un and

[
Cn,,u = (Xt > (1 + )φ(t)).
t∈[tn+1 ,tn ]
218 The Brownian motion

Since X is continuous, the union can be replaced by a countable union so Cn,,u is measurable.
For a given > 0 and 0 < u < 1 it is seen that

(Cn,,u i.o.) = ∀n0 ≥ 1 ∃n ≥ n0 ∃t ∈ [tn+1 , tn ] : Xt > (1 + )φ(t)

Xt
= ∀n ≥ 1 ∃t ≤ tn : Xt > (1 + )φ(t) = lim sup >1+
t→0 φ(t)
so it is thus clear that (6.14) follows if there for all > 0 exists a u, 0 < u < 1, such that

P (Cn,,u i.o.) = 0

and to deduce this, it is by the Borel-Cantelli Lemma (Lemma 1.2.11) sufficient that
X
P (Cn,,u ) < ∞. (6.16)
n

(Note that Cn,,u is only defined for n so large that tn = un ∈ [0, 1e ), the interval where φ is
defined. In all computations we of course only consider such n)

Since the function φ is continous on [0, 1e ) with φ(0) = 0 and φ(t) > 0 for t > 0, there exists
0 < δ0 < 1e such that φ is increasing on the interval [0, δ0 ]. Therefore it holds for n large (so
large that tn ≤ δ0 ) that
[
P (Cn,,u ) ≤ P (Xt > (1 + )φ(tn+1 )) = P sup Xt > (1 + )φ(tn+1 )
t:tn+1 ≤t≤tn
t:tn+1 ≤t≤tn

≤ P sup Xt > (1 + )φ(tn+1 ) ≤ 2P (Xtn > (1 + )φ(tn+1 )) ,
t:t≤tn

where we in the last inequality have used Lemma 6.2.9 and the continuity of X (which implies
that sup Xt = sup Xq ). Since √1tn Xtn is N (0, 1)-distributed it follows that
t≤tn q∈Q∩[0,tn ]

∞
r Z
2 2
P (Cn,,u ) ≤ e−s /2
ds,
π xn

where r
1 1
xn = (1 + ) √ φ(tn+1 ) = (1 + ) 2u log((n + 1) log ).
tn u
We see that xn → ∞ for n → ∞ and hence it holds by (6.13) that
R ∞ −s2 /2
xn
e ds
1 −x2n /2 →1.
xn e

In particular we have R∞ 2

xn
e−s /2
ds
2 →0
e−xn /2
6.4 The law of the iterated logarithm 219

For n large, we thus have

r
2 −x2n /2 2
P (Cn,,u ) ≤ e = K (n + 1)−(1+) u ,
π
q 2
where K = π2 (log u1 )−(1+) u . If we for a given > 0 choose u < 1 sufficiently close to 1,
we obtain that (1 + )2 u > 1, making
∞
X 2
K(n + 1)−(1+) u
<∞
n=1

Hence (6.16) and thereby also (6.14) follow.

To show (6.15), let tn = v n , where 0 < v < 1, put Zn = Xtn − Xtn+1 and define

Dn,,v = Zn > (1 − )φ(tn ) .
2
Note that the events Dn,,v for fixed and v and varying n are mutually independent.

We shall show that, given > 0, there exists a v, 0 < v < 1, such that

P (Dn,,v i.o.) = 1 (6.17)

and we claim that this implies (6.15): if we apply (6.14) to −X, we get
Xt
P lim inf ≥ −1 = 1,
t→0 φ(t)

so with (6.17) satisfied, we have for almost all ω ∈ Ω that

Xt (ω)
lim inf ≥ −1
t→0 φ(t)

and that there exists a subsequence (n0 ), n0 → ∞ of natural numbers (depending on ω) such
that for all n0
ω ∈ Dn0 ,,v .
But then, for all n0 ,

Xtn0 (ω) Zn0 (ω) Xtn0 +1 (ω) Xt 0 (ω) φ(tn0 +1 )

= + > 1 − + n +1
φ(tn0 ) φ(tn0 ) φ(tn0 ) 2 φ(tn0 +1 ) φ(tn0 )
and since
q
2v n+1 log (n + 1) log v1
s
log (n + 1) log v1

φ(tn0 +1 ) √ √
= q = v 1
→ v
φ(tn0 ) 1
2v n log n log v
log n log v
220 The Brownian motion

and furthermore
Xtn0 +1 (ω) Xt (ω)
lim inf ≥ lim inf ,
0
n →∞ φ(tn0 +1 ) t→0 φ(t)
we see that
Xt (ω) Xt 0 (ω) Xt 0 (ω) √
lim sup ≥ lim sup n ≥ 1 − + lim inf n +1 v
t→0 φ(t) 0
n →∞ φ(tn )
0 2 n 0 →∞ φ(tn0 +1 )
√
≥ 1 − − v ≥ 1 − ,
2
√
if v for given > 0 is chosen so small that v < 2 . Hence we have shown that (6.17) implies
(6.15).

We still need to show (6.17). Since the Dn,,v ’s for varying n are independent, (6.17) follows
from second version of the Borel-Cantelli lemma (Lemma 1.3.12) by showing that
X
P (Dn,,v ) = ∞. (6.18)
n

We conclude the proof by showing that this may be achieved by choosing v > 0 sufficiently
small for any given > 0 (this was already needed to conclude that (6.17) implies (6.15)).
But since Zn is N (0, tn − tn+1 )–distributed, we obtain
Z ∞
1 2
P (Dn,,v ) = √ e−s /2 ds,
2π yn

where r
φ(tn ) 2 1
yn = 1 − √ = (1 − ) log n log .
2 tn − tn+1 2 1−v v

Since yn → ∞, (6.13) implies that

R∞ 2

yn
e−s /2 ds
1 −yn2 /2 → 1,
yn e

and since
q s
1

y log n log v log n + log log 1
√ n = const. · √ = const. · v
→ const. > 0 ,
log n log n log n

the proof is now finished by realising that for given > 0 we have
X 1 2
√ e−yn /2 = ∞ (6.19)
log n
6.4 The law of the iterated logarithm 221

if only v is sufficiently close to 0. But

!
1 − 2

2
−yn 1
e /2
= exp − log log = K n−α ,
1−v v
2
where α = (1−/2)
1−v and K = (log v1 )−α , so α < 1 if v is sufficiently small, and it is then a
simple matter to obtain (6.19): if, e.g., β > 0 is so small that α + β < 1 then the n’th term
of (6.19) becomes
K 1 nβ 1 K
√ = K √ > α+β
log n nα log n nα+β n
P −(α+β)
for n sufficiently large, and since n = ∞ the desired conclusion follows.

As an immediate consequence of Theorem 6.4.1 we obtain the following result concerning the
number of points where the Brownian motion is zero.

Corollary 6.4.2. If X is a continuous Brownian motion, it holds for almost all ω that for
all > 0, Xt (ω) = 0 for infinitely many values of t ∈ [0, ].

Note that it trivially holds that

\
P (Xq 6= 0) = 1,
q∈Q∩(0,∞)

so for almost all ω there exists, for each rational q > 0, an open interval around q where
t → Xt (ω) does not take the value 0. In some sense therefore, Xt (ω) is only rarely 0 but it
still happens an infinite number of times close to 0.

Proof. Theorem 6.4.1 implies that for almost every ω there exist sequences 0 < sn ↓ 0 and
0 < tn ↓ 0 such that
1 √ 1 √
Xsn (ω) > φ(sn ) σ 2 > 0, Xtn (ω) < − φ(tn ) σ 2 < 0
2 2
for all n. Since X is continuous, the corollary follows.

Our final result is an analogue to Theorem 6.4.1, with t → ∞ instead of t → 0. However,

this result only holds for Brownian motions with drift ξ = 0.

Theorem 6.4.3. If X is a continuous standard Brownian motion, then

Xt
P lim sup √ = 1 = 1,
t→∞ 2t log log t
222 The Brownian motion

Xt
P lim inf √ = −1 = 1 .
t→∞ 2t log log t

Proof. Define a new process Y by


t X 1 (t > 0)
Yt = t
0 (t = 0).

Then t 7→ Yt is continuous on the open interval (0, ∞) and for arbitrary n and 0 < t1 <
· · · < tn it is clear that (Yt1 , . . . , Ytn ) follows an n-dimensional normal distribution. Since
Y0 = X0 = 0 and we for 0 < s < t have EYt = 0 while (recall the finite–dimensional
distributions of the Brownian motion)

Cov (Ys , Yt ) = st Cov (X 1s , X 1t ) = s,

it follows that Y and X have the same finite-dimensional distributions. In particular we

therefore have

P lim Yq = 0 = P lim Xq = 0 = 1 ,
q→0,q∈Q q→0,q∈Q

and with Y continuous on (0, ∞) we see that Y becomes continuous on [0, ∞). But then the
continuous process Y has the same distribution as X, and thus, Y is a continuous, normalized
Brownian motion. Theorem 6.4.1 applied to Y then shows us that for instance

sX 1s
lim sup q = 1 a.s.
s→0 1
2s log log s

1
If the s here is replaced by t we obtain

Xt
lim sup √ = 1 a.s.
t→∞ 2t log log t

as we wanted.

From Theorem 6.4.3 it easily follows, by an argument similar to the one we used in the proof
of Corollary 6.4.2, that for t → ∞ a standard Brownian motion will cross any given level
x ∈ R infinitely many times.

Corollary 6.4.4. If X is a continuous, normalized Brownian motion it holds for almost

every ω that for all T > 0 and all x ∈ R, Xt (ω) = x for infinitely many values of t ∈ [T, ∞).
6.5 Exercises 223

6.5 Exercises

Exercise 6.1. Let X = (Xt )t≥0 be a Brownian motion with drift ξ ∈ R and variance
σ 2 > 0. Define for each t ≥ 0
Xt − ξt
X̃t =
σ
Show that X̃ = (X̃t )t≥0 is a normalised Brownian motion. ◦

Exercise 6.2. Assume that (Ω, F, P ) is a probability space, and assume that D1 , D2 ⊆ F
both are ∩–stable collections of sets. Assume that D1 and D2 are independent, that is

P (D1 ∩ D2 ) = P (D1 )P (D2 ) for all D1 ∈ D1 , D2 ∈ D2

Show that σ(D1 ) and σ(D2 ) are independent:

P (D1 ∩ D2 ) = P (D1 )P (D2 ) for all D1 ∈ σ(D1 ), D2 ∈ σ(D2 ) (6.20)

Exercise 6.3. Let X = (Xt )t≥0 be a Brownian Motion with drift ξ and variance σ 2 > 0.
Define for each t > 0 the σ–algebra

Ft = F(Xs : 0 ≤ s ≤ t)

Show that Ft is independent of σ(Xu − Xt ), where u > t.

You can use without argument that (similarly to the arguments in the beginning of Sec-
tion 6.1) (Xs )0≤s≤t has values in (R[0,t] , B[0,t] ) where the σ–algebra B[0,t] is generated by

G = { (X̂t1 , . . . , X̂tn ) ∈ Bn : n ∈ N, 0 ≤ t1 < · · · < tn ≤ t, Bn ∈ Bn }

Then Ft must be generated by the pre images of these sets

D = {((Xs )0≤s≤t )−1 (G) : G ∈ G}

= { (Xt1 , . . . , Xtn ) ∈ Bn : n ∈ N, 0 ≤ t1 < · · · < tn ≤ t, Bn ∈ Bn }

Exercise 6.4. Assume that X = (Xt )t≥0 is a Brownian Motion with drift ξ ∈ R and variance
σ 2 > 0. Define Ft as in Exercise 6.3. Show that X has the following Markov property for
t≥s
E(Xt |Fs ) = E(Xt |Xs ) a.s.
224 The Brownian motion

Exercise 6.5. Assume that X = (Xt )t≥0 is a normalised Brownian motion and define Ft as
in Exercise 6.3 for each t ≥ 0. Show that

(1) Xt is Ft measurable for all t ≥ 0.

(2) E|Xt | < ∞ for all t ≥ 0

(3) E(Xt |Fs ) = Xs a.s. for all 0 ≤ s < t.

We say that (Xt , Ft )t≥0 is a martingale in continuous time. ◦

Exercise 6.6. Let X = (Xt )t≥0 be a normalised Brownian motion. Let T > 0 be fixed and
define the process B T = (BtT )0≤t≤T by
t
BtT = Xt − XT
T
The process B T is called a Brownian bridge on [0, T ].

1) Show that for all 0 < t1 < · · · < tn < T then

(BtT1 , . . . , BtTn )

is n–dimensional normally distributed. And find for 0 < s < t < T

EBsT and Cov(BsT , BtT )

2) Show that for all T > 0 then

D √
(BTT t )0≤t≤1 = ( T Bt1 )0≤t≤1

Exercise 6.7. A stochastic process X = (Xt )t≥0 is self–similar if there exists H > 0 such
that
D
(Xγt )t≥0 = (γ H Xt )t≥0 for all γ > 0 .
intuitively, this means that if we ”zoom in” on the process, then it looks like a scaled version
of the original process.
6.5 Exercises 225

(1) Assume that X = (Xt )t≥0 is a Brownian motion with drift 0 and variance σ 2 . Show
that X is self–similar and find the parameter H.

Now assume that X = (Xt )t≥0 is a stochastic process (defined on (Ω, F, P )) that is self–
similar with parameter 0 < H < 1. Assume that X has stationary increments:
D
Xt − Xs = Xt−s for all 0 ≤ s ≤ t .

Assume furthermore that P (X0 = 0) = 1 and P (X1 = 0) = 0.

(2) Show that for all 0 ≤ s < t

Xt − Xs D
= (t − s)H−1 X1 .
t−s

(3) Show that P (Xt = 0) = 0 for all t > 0.

(4) Show that X is continuous in probability.

Exercise 6.8. Let Y be a normalised Brownian motion with continuous sample paths.
Hence Y can be considered as a random variable with values in (C[0,∞) , C[0,∞) ). Define for
all n, M ∈ N the set
( )
|xt | 1
Cn,M = x ∈ C[0,∞) sup >
t∈[n,n+1] t M

(1) Show that Cn,M ∈ C[0,∞) .

(2) Show that

n
P (Y ∈ Cn,M ) ≤ 2P Yn+1 >
M
(3) Show that
n 3(n + 1)2 M 4
P Yn+1 > ≤
M n4
and conclude that
∞
X
P (Y ∈ Cn,M ) < ∞ .
n=1
226 The Brownian motion

(4) Show that for all M ∈ N

!
|Yt | 1
P sup > i.o. = 0.
t∈[n,n+1] t M

(5) Show that

∞
!
\ |Yt | 1
P sup ≤ evt. = 1.
M =1 t∈[n,n+1] t M

(6) Show that

Yt a.s.
−→ 0 as t → ∞ .
t

◦
Chapter 7

The results on Brownian motion in Chapter 6 represents an introduction to the problems

and results of the theory of continuous-time stochastic processes. This very large subject
encompasses many branches, prominent among them are continuous-time martingale theory,
stochastic integration theory and continuous-time Markov process theory, to name a few.
A good introduction to several of the major themes can be found in Rogers & Williams
(2000a) and Rogers & Williams (2000b). Karatzas & Shreve (1988) focuses on martingales
with continuous paths and the theory of stochastic integration. A solid introduction to
continuous-time Markov processes is Ethier & Kurtz (1986).
Appendix A

Supplementary material

In this chapter, we outline results which are either assumed to be well-known, or which are
of such auxiliary nature as to merit separation from the main text.

A.1 Limes superior and limes inferior

In this section, we recall some basic results on the supremum and infimum of a set in the
extended real numbers, as well as the limes superior and limes inferior of a sequence in R.
By R∗ , we denote the set R ∪ {−∞, ∞}, and endow R∗ with its natural ordering, in the sense
that −∞ < x < ∞ for all x ∈ R. We refer to R∗ as the extended real numbers. In general,
working with R∗ instead of merely R is useful, although somewhat technically inconvenient
from a formal point of view.

Definition A.1.1. Let A ⊆ R∗ . We say that y ∈ R∗ is an upper bound for A if it holds for
all x ∈ A that x ≤ y. Likewise, we say that y ∈ R∗ is a lower bound for A if it holds for all
x ∈ A that y ≤ x.

Theorem A.1.2. Let A ⊆ R∗ . There exists a unique element sup A ∈ R∗ characterized by

that sup A is an upper bound for A, and for any upper bound y for A, sup A ≤ y. Likewise,
there exists a unique element inf A ∈ R∗ characterized by that inf A is a lower bound for A,
and for any lower bound y for A, y ≤ inf A.
230 Supplementary material

Proof. See Theorem C.3 of Hansen (2009).

The elements sup A and inf A whose existence and uniqueness are stated in Theorem A.1.2
are known as the supremum and infimum of A, respectively, or as the least upper bound and
greatest lower bound of A, respectively.

In general, the formalities regarding the distinction between R and R∗ are necessary to keep
in mind when concerned with formal proofs, however, in practice, the supremum and infimum
of a set in R∗ is what one expects it to be: For example, the supremum of A ⊆ R∗ is infinity
precisely if A contains “arbitrarily large elements”, otherwise it is the “upper endpoint” of
the set, and similarly for the infimum.

The following yields useful characterisations of the supremum and infimum of a set when the
supremum and infimum is finite.

Lemma A.1.3. Let A ⊆ R∗ and let y ∈ R. Then y is the supremum of A if and only if the
following two properties hold:

(1). y is an upper bound for A.

(2). For each ε > 0, there exists x ∈ A such that y − ε < x.

Likewise, y is the infimum of A if and only if the following two properties hold:

(1). y is a lower bound for A.

(2). For each ε > 0, there exists x ∈ A such that x < y + ε.

Proof. We just prove the result on the supremum. Assume that y is the supremum of A. By
definition, y is then an upper bound for A. Let ε > 0. If y − ε were an upper bound for A,
we would have y ≤ y − ε, a contradiction. Therefore, y − ε is not an upper bound for A, and
so there exists x ∈ A such that y − ε < x. This proves that the two properties are necessary
for y to be the supremum of A.

To prove the converse, assume that the two properties hold, we wish to show that y is the
supremum of A. By our assumptions, y is an upper bound for A, so it suffices to show that
for any upper bound z ∈ R∗ , we have y ≤ z. To obtain this, note that by the second of our
A.1 Limes superior and limes inferior 231

assumptions, A is nonempty. Therefore, −∞ is not an upper bound for A. Thus, it suffices

to consider an upper bound z ∈ R and prove that y ≤ z. Letting z be such an upper bound,
assume that z < y and put ε = y − z. There then exists x ∈ A such that z = y − ε < x. This
shows that z is not an upper bound for A, a contradiction. We conclude that for any upper
bound z of A, it must hold that y ≤ z. Therefore, y is the supremum of A, as desired.

We also have the following useful results.

Lemma A.1.4. Let A, B ⊆ R∗ . If A ⊆ B, sup A ≤ sup B and inf B ≤ inf A.

Proof. See Lemma C.4 of Hansen (2009).

Lemma A.1.5. Let A ⊆ R∗ and assume that A is nonempty. Then inf A ≤ sup A.

Proof. See Lemma C.5 of Hansen (2009).

Lemma A.1.6. Let A ⊆ R∗ . Put −A = inf{−x | x ∈ A}. Then − sup A = inf(−A) and
− inf A = sup(−A).

Proof. See p. 4 of Carothers (2000).

A particular result which we will be of occasional use to us is the following.

Lemma A.1.7. Let A ⊆ R∗ , and let y ∈ R. Then sup A > y if and only if there exists x ∈ A
with x > y. Analogously, inf A < y if and only if there exists x ∈ A with x < y.

Proof. We prove the result on the supremum. Assume that sup A > y. If sup A is infinite, A
is not bounded from above, and so there exists arbitrarily large elements in A, in particular
there exists x ∈ A with x > y. If sup A is finite, Lemma A.1.3 shows that with ε = sup A − y,
there exists x ∈ A such that y = sup A − ε < x. This proves that if sup A > y, there exists
x ∈ A with x > y. Conversely, if there is x ∈ A with x > y, we also obtain y < x ≤ sup A,
since sup A is an upper bound for A. This proves the other implication.

Note that the result of Lemma A.1.7 is false if the strict inequalities are exchanged with
inequalities. For example, sup[0, 1) ≥ 1, but there is no x ∈ [0, 1) with x ≥ 1. Next, we turn
our attention to sequences.
232 Supplementary material

Definition A.1.8. Let (xn ) be a sequence in R. We define

lim sup xn = inf sup xk

n→∞ n≥1 k≥n

lim inf xn = sup inf xk ,

n→∞ n≥1 k≥n

and refer to lim supn→∞ xn and lim inf n→∞ xn as the limes superior and limes inferior of
(xn ), respectively.

The limes superior and limes inferior are useful tools for working with sequences and in
particular for proving convergence.

Lemma A.1.9. Let (xn ) be a sequence in R. Then lim inf n→∞ xn ≤ lim supn→∞ xn .

Proof. See Lemma C.11 of Hansen (2009).

Theorem A.1.10. Let (xn ) be a sequence in R, and let c ∈ R∗ . xn converges to c if and only
if lim inf n→∞ xn = lim supn→∞ xn = c. In particular, (xn ) is convergent to a finite limit if
and only if the limes inferior and limes superior are finite and equal, and in the affirmative,
the limit is equal to the common value of the limes inferior and the limes superior.

Proof. See Theorem C.15 and Theorem C.16 of Hansen (2009).

Corollary A.1.11. Let (xn ) be a sequence of nonnegative numbers. Then xn converges to

zero if and only if lim supn→∞ xn = 0.

Proof. By Theorem A.1.10, it holds that lim supn→∞ xn = 0 if xn converges to zero. Con-
versely, assume that lim supn→∞ xn = 0. As zero is a lower bound for (xn ), we find
0 ≤ lim inf n→∞ xn ≤ lim supn→∞ xn = 0, so Theorem A.1.10 shows that xn converges
to zero.

A.2 Measure theory and real analysis

In this section, we recall some of the main results from measure theory and real analysis
which will be needed in the following. We first recall some results from basic measure theory,
see Hansen (2009) for a general exposition.
234 Supplementary material

Definition A.2.1. Let E be a set. Let E be a collection of subsets of E. We say that E is a

σ-algebra on E if it holds that E ∈ E, that if A ∈ E, then Ac ∈ E, and that if (An )n≥1 ⊆ E,
then ∪∞n=1 An ∈ E.

We say that a pair (E, E), where E is some set and E is a σ-algebra on E, is a measurable
space. Also, if H is some set of subsets of E, we define σ(H) to be the smallest σ-algebra
containing H, meaning that H is the intersection of all σ-algebras on E containing H. For a
σ-algebra E on E and a family H of subsets of E, we say that H is a generating family for
E if E = σ(H). One particular example of this is the Borel σ-algebra BA on A ⊆ Rn , which
is the smallest σ-algebra on A containing all open sets in A. In particular, we denote by Bn
the Borel σ-algebra on Rn .

If it holds for all A, B ∈ H that A ∩ B ∈ H, we say that H is stable under finite intersections.
Also, if D is a family of subsets of E, we say that D is a Dynkin class if it satisfies the
following requirements: E ∈ D, if A, B ∈ D with A ⊆ B then B \ A ∈ D, and if (An ) ⊆ D
with An ⊆ An+1 for all n ≥ 1, then ∪∞ n=1 An ∈ D. We have the following useful result.

Lemma A.2.2 (Dynkin’s lemma). Let D be a Dynkin class on E, and let H be a set of
subsets of E which is stable under finite intersections. If H ⊆ D, then σ(H) ⊆ D.

Proof. See Theorem 3.6 of Hansen (2009), or Theorem 4.1.2 of Ash (1972).

Definition A.2.3. Let (E, E) be a measurable space. We say that a function µ : E → [0, ∞)
is a measure, if it holds that µ(∅) = 0 and that whenever (An ) ⊆ E is a sequence of pairwise
P∞
disjoint sets, µ(∪∞
n=1 An ) = n=1 µ(An ).

We say that a triple (E, E, µ) is a measure space. Also, if there exists an increasing sequence
of sets (En ) ⊆ E with E = ∪∞ n=1 En and such that µ(En ) is finite, we say that µ is σ-finite
and refer to (E, E, µ) as a σ-finite measure space. If µ(E) is finite, we say that µ is finite, and
if µ(E) = 1, we say that µ is a probability measure. In the latter case, we refer to (E, E, µ)
as a probability space. An important application of Lemma A.2.2 is the following.

Theorem A.2.4 (Uniqueness theorem for probability measures). Let P and Q be two prob-
ability measures on (E, E). Let H be a generating family for E which is stable under finite
intersections. If P (A) = Q(A) for all A ∈ H, then P (A) = Q(A) for all A ∈ E.

Proof. See Theorem 3.7 in Hansen (2009).

A.2 Measure theory and real analysis 235

Next, we consider measurable mappings.

Definition A.2.5. Let (E, E) and (F, F) be two measurable spaces. Let f : E → F be some
mapping. We say that f is E-F measurable if f −1 (A) ∈ E whenever A ∈ F.

For a family of mappings (fi )i∈I from E to Fi , where (Fi , Fi ) is some measurable space,
we may introduce σ((fi )i∈I ) as the smallest σ-algebra E on E such that all the fi are E-Fi
measurable. Formally, E is the σ-algebra generated by {(fi ∈ A) | i ∈ I, A ∈ Fi }. For
measurability with respect to such σ-algebras, we have the following very useful lemma.

Lemma A.2.6. Let E be a set, let (fi )i∈I be a family of mappings from E to Fi , where
(Fi , Fi ) is some measurable space, and let E = σ((fi )i∈I ). Let (H, H) be some other mea-
surable space, and let g : H → E. Then g is H-E measurable if and only if fi ◦ g is H-Fi
measurable for all i ∈ I.

Proof. See Lemma 4.14 of Hansen (2009) for a proof in the case of a single variable.

If f : E → R is E-B measurable, we say that f is Borel measurable. In the context of

probability spaces, we refer to Borel measurable mappings as random variables. For any
measure space (E, E, µ) and any Borel measurable mapping f : E → [0, ∞], the integral
R
f dµ is well-defined as the supremum of the explicitly constructed integral of an appropriate
R
class of simpler mappings. If instead we consider some f : E → R, the integral f dµ is
well-defined as the difference between the integrals of the positive and negative parts of f
R
whenever |f | dµ is finite. The integral has the following important properties.

Theorem A.2.7 (The monotone convergence theorem). Let (E, E, µ) be a measure space,
and let (fn ) be a sequence of measurable mappings fn : E → [0, ∞]. Assume that the sequence
(fn ) is increasing µ-almost everywhere. Then
Z Z
lim fn dµ = lim fn dµ.
n→∞ n→∞

Proof. See Theorem 6.12 in Hansen (2009).

Lemma A.2.8 (Fatou’s lemma). Let (E, E, µ) be a measure space, and let (fn ) be a sequence
of measurable mappings fn : E → [0, ∞]. It holds that
Z Z
lim inf fn dµ ≤ lim inf fn dµ.
n→∞ n→∞
236 Supplementary material

Proof. See Lemma 6.25 in Hansen (2009).

Theorem A.2.9 (The dominated convergence theorem). Let (E, E, µ) be a measure space,
and let (fn ) be a sequence of measurable mappings from E to R. Assume that the sequence
(fn ) converges µ-almost everywhere to some mapping f . Assume that there exists a measur-
able, integrable mapping g : E → [0, ∞) such that |fn | ≤ g µ-almost everywhere for all n.
Then fn is integrable for all n ≥ 1, f is measurable and integrable, and
Z Z
lim fn dµ = lim fn dµ.
n→∞ n→∞

Proof. See Theorem 7.6 in Hansen (2009).

For the next result, recall that for two σ-finite measure spaces (E, E, µ) and (F, F, ν), E ⊗ F
denotes the σ-algebra on E × F generated by {A × B | A ∈ E, B ∈ F}, and µ ⊗ ν denotes
the unique σ-finite measure such that (µ ⊗ ν)(A × B) = µ(A)ν(B) for A ∈ E and B ∈ F, see
Chapter 9 of Hansen (2009).

Theorem A.2.10 (Tonelli’s theorem). Let (E, E, µ) and (F, F, ν) be two σ-finite measure
spaces, and assume that f is nonnegative and E ⊗ F measurable. Then
Z Z Z
f (x, y) d(µ ⊗ ν)(x, y) = f (x, y) dν(y) dµ(x).

Proof. See Theorem 9.4 of Hansen (2009).

Theorem A.2.11 (Fubini’s theorem). Let (E, E, µ) and (F, F, ν) be two σ-finite measure
spaces, and assume that f is E ⊗ F measurable and µ ⊗ ν integrable. Then y 7→ f (x, y) is
integrable with respect to ν for µ-almost all x, the set where this is the case is measurable,
and it holds that
Z Z Z
f (x, y) d(µ ⊗ ν)(x, y) = f (x, y) dν(y) dµ(x).

Proof. See Theorem 9.10 of Hansen (2009).

Theorem A.2.12 (Jensen’s inequality). Let (E, E, µ) be a probability space. Let X : E → R

be a Borel mapping. Let f : R → R be another Borel mapping. Assume that X and f (X)
R R
are integrable and that f is convex. Then f ( X dµ) ≤ f (X) dµ.
A.2 Measure theory and real analysis 237

Proof. See Theorem 16.31 in Hansen (2009).

Theorem A.2.7, Lemma A.2.8 and Theorem A.2.9 are the three main tools for working with
integrals. Theorem A.2.12 is frequently useful as well, and can in purely probabilistic terms
be stated as the result that f (EX) ≤ Ef (X) when f is convex.

Also, for measurable spaces (E, E) and (F, F), µ a measure on (E, E) and t : E → F an E-F
measurable mapping, we define the image measure t(µ) as the measure on (F, F) given by
putting t(µ)(A) = µ(t−1 (A)) for A ∈ F. We then have the following theorem on successive
transformations.

Theorem A.2.13. Let (E, E), (F, F) and (G, G) be measurable spaces. Let µ be a measure
on (E, E). Let t : E → F and s : F → G be measurable. Then s(t(µ)) = (s ◦ t)(µ).

Proof. See Theorem 10.2 of Hansen (2009).

The following abstract change-of-variable formula also holds.

Theorem A.2.14. Let (E, E, µ) be a measurable space and let (F, F) be some measure
space. Let t : E → F be measurable, and let f : F → R be Borel measurable. Then f
is t(µ)-integrable if and only if f ◦ t is µ-integrable, and in the affirmative, it holds that
R R
f dt(µ) = f ◦ t dµ.

Proof. See Corollary 10.9 of Hansen (2009).

Next, we recall some results on Lp spaces.

Definition A.2.15. Let (E, E, µ) be a measurable space, and let p ≥ 1. By Lp (E, E, µ), we
denote the set of measurable mappings f : E → R such that |f |p dµ is finite.
R

We endow Lp (E, E, µ) with the norm k · kp given by kf kp = ( |f |p dµ)1/p . That Lp (E, E, µ)

is a vector space and that k·kp is a seminorm on this space is a consequence of the Minkowski
inequality, see Theorem 2.4.7 of Ash (1972). We refer to Lp (E, E, µ) as an Lp -space. For
Lp -spaces, the following two main results hold.

Theorem A.2.16 (Hölder’s inequality). Let p > 1 and let q be the dual exponent to p,
meaning that q > 1 is uniquely determined as the solution to the equation p1 + 1q = 1. If
f ∈ Lp (E, E, µ) and g ∈ Lq (E, E, µ), it holds that f g ∈ L1 (E, E, µ), and kf gk1 ≤ kf kp kgkq .
238 Supplementary material

Proof. See Theorem 2.4.5 of Ash (1972).

Theorem A.2.17 (The Riesz-Fischer completeness theorem). Let p ≥ 1. The seminormed

vector space Lp (E, E, µ) is complete.

Proof. See Theorem 2.4.11 of Ash (1972).

Following these results, we recall a simple lemma which we make use of in the proof of the
law of large numbers.

Lemma A.2.18. Let (xn ) be some sequence in R, and let x be some element of R. If
Pn
limn→∞ xn = x, then limn→∞ n1 k=1 xk = x as well.

Proof. See Lemma 15.5 of Carothers (2000).

Also, we recall some properties of the integer part function. For any x ∈ R, we define
[x] = sup{n ∈ Z | n ≤ x}.

Lemma A.2.19. It holds that [x] is the unique integer such that [x] ≤ x < [x] + 1, or
equivalently, the unique integer such that x − 1 < [x] ≤ x.

Proof. We first show that [x] satisfies the bounds [x] ≤ x < [x] + 1. As x is an upper bound
for the set {n ∈ Z | n ≤ x}, and [x] is the least upper bound, we obtain [x] ≤ x. On the
other hand, as [x] is an upper bound for {n ∈ Z | n ≤ x}, [x] + 1 cannot be an element of
this set, yielding x < [x] + 1.

This shows that [x] satisfies the bounds given. Now assume that m is an integer satisfying
m ≤ x < m + 1, we claim that m = [x]. As m ≤ x, we obtain m ≤ [x]. And as x < m + 1,
m + 1 is not in {n ∈ Z | n ≤ x}. In particular, for all n ≤ x, n < m + 1. As [x] ≤ x, this
yields [x] < m + 1, and so [x] ≤ m. We conclude that m = [x], as desired.

Lemma A.2.20. Let x ∈ R and let n ∈ Z. Then [x + n] = [x] + n.

Proof. As [x] ≤ x < [x] + 1 is equivalent to [x] + n ≤ x + n < [x] + n + 1, the characterization
of Lemma A.2.19 yields the result.
A.3 Existence of sequences of random variables 239

Finally, we state Taylor’s theorem with Lagrange form of the remainder.

Theorem A.2.21. Let n ≥ 1 and assume that f is n times differentiable, and let x, y ∈ Rp .
It then holds that
n−1
X f (k) (x) f (n) (ξ(x, y))
f (y) = (y − x)k + (y − x)n ,
n! n!
k=0

where f (k) denotes the k’th derivative of f , with the convention that f (0) = f , and ξ(x, y) is
some element on the line segment between x and y.

Proof. See Apostol (1964) Theorem 7.6.

A.3 Existence of sequences of random variables

In this section, we state a result which yield the existence of particular types of sequences of
random variables.

Theorem A.3.1 (Kolmogorov’s consistency theorem). Let (Qn )n≥1 be a sequence of prob-
ability measures such that Qn is a probability measure on (Rn , Bn ). For each n ≥ 2, let
πn : Rn → Rn−1 denote the projection onto the first n − 1 coordinates. Assume that
πn (Qn ) = Qn−1 for all n ≥ 2. There exists a probability space (Ω, F, P ) and a sequence
of random variables (Xn )n≥1 on (Ω, F, P ) such that for all n ≥ 1, (X1 , . . . , Xn ) have distri-
bution Qn .

Proof. This follows from Theorem II.30.1 of Rogers & Williams (2000a).

Corollary A.3.2. Let (Qn )n≥1 be a sequence of probability measures on (R, B). There exists
a probability space (Ω, F, P ) and a sequence of random variables (Xn )n≥1 on (Ω, F, P ) such
that for all n ≥ 1, (X1 , . . . , Xn ) are independent, and Xn has distribution Qn .

Proof. This follows from applying Theorem A.3.1 with the sequence of probability measures
(Q1 ⊗ · · · ⊗ Qn )n≥1 .

From Corollary A.3.2, it follows for example that there exists a probability space (Ω, F, P )
and a sequence of independent random variables (Xn )n≥1 on (Ω, F, P ) such that Xn is dis-
240 Supplementary material

tributed on {0, 1} with P (Xn = 1) = pn , where (pn ) is some sequence in [0, 1]. Such sequences
are occasionally used as examples or counterexamples regarding certain propositions.

A.4 Exercises

Exercise A.1. Let A = {1 − n1 |n ≥ 1}. Find sup A and inf A. ◦

Exercise A.2. Let y ∈ R. Define A = {x ∈ Q | x < y}. Find sup A and inf A. ◦

Exercise A.3. Let (E, E, µ) be a measure space, and let (fn ) be a sequence of measurable
mappings fn : E → [0, ∞). Assume that there is g : E → [0, ∞) such that fn ≤ g for all n ≥ 1,
R R
where g is integrable with respect to µ. Show that lim supn→∞ fn dµ ≤ lim supn→∞ fn dµ.
◦
Appendix B

Hints for exercises

B.1 Hints for Chapter 1

Hints for exercise 1.2. Consider the probability space (Ω, F, P ) = ([0, 1], B[0,1] , λ), where λ
denotes the Lebesgue measure on [0, 1]. As a counterexample, consider variables defined as
Xn (ω) = nλ(An )−1 1An for an appropriate sequence of intervals (An ). ◦

Hints for exercise 1.4. Show that the sequence (EXn )n≥1 diverges and use this to obtain the
result. ◦

Hints for exercise 1.5. Show that for any ω with P ({ω}) > 0, Xn (ω) converges to X(ω).
Obtain the desired result by noting that {ω | P ({ω}) > 0} is an almost sure set. ◦

Hints for exercise 1.6. Show that P (|Xn − X| ≥ ε) ≤ P (|Xn 1Fk − X1Fk | ≥ ε) + P (Fkc ), and
use this to obtain the result. ◦

Hints for exercise 1.7. To prove that limn→∞ P (|Xn − X| ≥ εk ) = 0 for all k ≥ 1 implies
P
Xn −→ X, take ε > 0 and pick k such that 0 ≤ εk ≤ ε. ◦

Hints for exercise 1.8. Using that the sequence (supk≥n |Xk − X| > ε)n≥1 is decreasing, show
that limn→∞ P (supk≥n |Xk − X| > ε) = P (∩∞ ∞
n=1 ∪k=n |Xk − X| > ε). Use this to prove the
result. ◦
242 Hints for exercises

Hints for exercise 1.9. To obtain that d is a pseudometric, use that x 7→ x(1 + x)−1 is
P
increasing on [0, ∞). To show that Xn −→ X implies convergence in d, prove that for all
ε
ε > 0, it holds that d(Xn , X) ≤ P (|Xn − X| > ε) + 1+ε . In order to obtain the converse,
apply Lemma 1.2.7. ◦

1 1
Hints for exercise 1.10. Choose cn as a positive number such that P (|Xn | ≥ ncn ) ≤ 2n . Use
Lemma 1.2.11 to show that this choice yields the desired result. ◦

Hints for exercise 1.11. To prove the first claim, apply Lemma 1.2.7 with p = 4. To prove
the second claim, apply Lemma 1.2.12. ◦

Hints for exercise 1.12. Use Lemma 1.2.13 and Fatou’s lemma to show that E|X|p is finite.
Apply Hölder’s inequality to obtain convergence in Lq for 1 ≤ q < p. ◦

Hints for exercise 1.13. Define S(n, m) = ∩∞ 1

k=n (|Xk − Xn | ≤ m ) and argue that it suffices
for each ε > 0 to show that there exists F ∈ F with P (F c ) ≤ ε such that for all m ≥ 1, there
is n ≥ 1 with F ⊆ S(n, m). To obtain such a set F , consider a sequence (εm )m≥1 of positive
P∞
numbers with m=1 εm ≤ ε and choose for each m an nm with P (S(nm , n)) ≥ 1 − εm . ◦

Hints for exercise 1.14. Apply Lemma 1.2.12 and Lemma 1.2.7. ◦

Hints for exercise 1.15. Use Lemma 1.2.13, also recalling that all sequences in R which are
monotone and bounded are convergent. ◦

Hints for exercise 1.16. First argue that (|Xn+1 − Xn | ≤ εn evt.) is an almost sure set. Use
P∞
this to show that almost surely, for n > m large enough, |Xn − Xm | ≤ k=m εk . Using that
P∞
k=m εk tends to zero as m tends to infinity, conclude that (Xn ) is almost surely Cauchy. ◦

Hints for exercise 1.17. To prove almost sure convergence, calculate an explicit expression for
P (|Xn − 1| ≥ ε) and apply Lemma 1.2.12. To prove convergence in Lp , apply the dominated
convergence theorem. ◦

Hints for exercise 1.18. Apply Lemma 1.2.11. ◦

Hints for exercise 1.19. Use Lemma 1.3.12 to prove the contrapositive of the desired impli-
cation. ◦

Hints for exercise 1.20. To calculate P (Xn / log n > c i.o.), use the properties of the expo-
B.1 Hints for Chapter 1 243

nential distribution to obtain an explicit expression for P (Xn / log n > c), then apply Lemma
1.3.12. To prove lim supn→∞ Xn / log n = 1 almost surely, note that for all c > 0, it holds
that lim supn→∞ Xn / log n ≤ c when Xn / log n ≤ c eventually and lim supn→∞ Xn / log n ≥ c
when Xn / log n > c infinitely often. ◦

Hints for exercise 1.21. Use that sequence (∪k=n (Xk ∈ B))n≥1 is decreasing to obtain that
(Xn ∈ B i.o.) is in J . Take complements to obtain the result on (Xn ∈ B evt.). ◦

Hints for exercise 1.22. Show that

n
! n
!
X X
lim an−k+1 Xk ∈ B = lim an−k+1 Xk ∈ B ,
n→∞ n→∞
k=1 k=m

and use this to obtain the result. ◦

Hints for exercise 1.23. For the result on convergence in probability, work directly from the
definition of convergence in probability and consider 0 < ε < 1 in this definition. For the
result on almost sure convergence, note that Xn converges to zero if and only if Xn is zero
eventually, and apply Lemma 1.3.12. ◦

Hints for exercise 1.24. Use the monotone convergence theorem. ◦

Pn
Hints for exercise 1.25. Use Theorem 1.3.10 to show that k=1 ak Xk either is almost surely
divergent or almost surely convergent. To obtain the sufficient criterion for convergence,
apply Theorem 1.4.2. ◦

Hints for exercise 1.26. Let (Xn ) be an sequence of independent random variables concen-
trated on {0, n} with P (Xn = n) = pn . Use Lemma 1.3.12 to choose (pn ) so as to obtain the
result. ◦

Hints for exercise 1.27. Apply Theorem 1.4.3. ◦

Hints for exercise 1.28. Use Lemma 1.3.12 to conclude that P (|Xn | > n i.o.) = 0 if and
P∞
only if n=1 P (|X1 | > n) is finite. Apply the monotone convergence theorem and Tonelli’s
theorem to conclude that the latter is the case if and only if E|X1 | is finite. ◦

Hints for exercise 1.29. Apply Exercise 1.28 to show that E|X1 | is finite. Apply Theorem
1.5.3 to show that EX1 = c. ◦
244 Hints for exercises

B.2 Hints for Chapter 2

Hints for exercise 2.1. To show that T is measure preserving, find simple explicit expressions
for T (x) for 0 ≤ x < 12 and 21 ≤ x < 1, respectively, and use this to show the relationship
P (T −1 ([0, α))) = P ([0, α)) for 0 ≤ α ≤ 1. Apply Lemma 2.2.1 to obtain that T is P -measure
preserving. To show that S is measure preserving, first show that it suffices to consider the
case where 0 ≤ λ < 1. Fix 0 ≤ α ≤ 1. Prove that for α ≥ µ, S −1 ([0, α)) = [0, α−µ)∪[1−µ, 1),
and for α < µ, S −1 ([0, α)) = [1 − µ, 1 − µ + α). Use this and Lemma 2.2.1 to obtain the
result. ◦

Hints for exercise 2.2. Apply Lemma 2.2.1. To do so, find a simple explicit expression
1
for T (x) when n+1 < x ≤ n1 , and use this to calculate, for 0 ≤ α < 1, T −1 ([0, α)) and
subsequently P (T −1 ([0, α))), ◦

Hints for exercise 2.3. Assume, expecting a contradiction, that P is a probability measure
such that T is measure preserving for P . Show that this implies P ({0}) = 0 and that
P (( 21n , 22n ]) = 0 for all n ≥ 1. Use this to obtain the desired contradiction. ◦

Hints for exercise 2.4. Let λ = n/m for n ∈ Z and m ∈ N. Show that T m (x) = x in this
case. Fix 0 ≤ α ≤ 1 and put Fα = ∪m−1 k=0 T
−k
([0, α]) and show that for α small and positive,
Fα is a set in the T -invariant σ-algebra which has a measure not equal to zero or one. ◦

R
Hints for exercise 2.5. Prove that X − X ◦ T dP = 0 and use this to obtain the result. ◦

Hints for exercise 2.6. Show that IT ⊆ IT 2 and use this to prove the result. ◦

Hints for exercise 2.7. Consider a space Ω containing only two points. ◦

Hints for exercise 2.8. For part two, note that ∪∞ k=n T
−k
(F ) ⊆ ∪∞
k=0 T
−k
(F ) and use that T
∞ −k
is measure preserving. For part three, use that F ⊆ ∪k=0 T (F ). For part four, use that
F = (F ∩ (T k ∈ F c evt.)) ∪ (F ∩ (T k ∈ F c evt.)c ) ◦

Hints for exercise 2.9. To show that the criterion is sufficient for T to be ergodic, use
Theorem 2.2.3. For the converse implication, assume that T is ergodic and use Theorem
2.2.3 to argue that the result holds when X and Y are indicators for sets in F. Consider
X = 1G and Y nonnegative and bounded and use linearity and approximation with simple
functions to obtain that the criterion also holds in this case. Use a similar argument to
B.2 Hints for Chapter 2 245

obtain the criterion for general X and Y such that X is nonnegative and integrable and Y
is nonnegative and bounded. Use linearity to obtain the final extension to X integrable and
Y bounded. ◦

Hints for exercise 2.10. First use Lemma 2.2.6 to argue that it suffices to show that for
α, β ∈ [0, 1), limn→∞ P ([0, β) ∩ T −n ([0, α))) = P ([0, β))P ([0, α)). To do so, first show that
T n (x) = 2n x − [2n x] and use this to obtain a simple explicit expression for T −n ([0, α)). Use
this to prove the desired result. ◦

Hints for exercise 2.11. For part one, use that the family {F1 × F2 | F1 ∈ F1 , F2 ∈ F2 } is
a generating family for F1 ⊗ F2 which is stable under finite intersections and apply Lemma
2.2.1. For part two, show that whenever F1 is T1 -invariant and F2 is T2 -invariant, F1 × F2 is
T -invariant, and use this to obtain the desired result. For part three, use that for F1 ∈ F1
and F2 ∈ F2 , it holds that P1 (F1 ) = P (F1 × Ω2 ) and P2 (F2 ) = P (Ω1 × F2 ). For part four,
use Lemma 2.2.6. ◦

Hints for exercise 2.12. Let A = (X̂n ∈ B i.o.) and note that (Xn ∈ B i.o.) = X −1 (A).
Show that A is θ-invariant to obtain the result. ◦

Hints for exercise 2.13. For B ∈ B∞ , express Z(P )(B) in terms of X(P )(B), Y (P )(B) and
p. Use this to obtain that θ is measure preserving for Z(P ). ◦

Hints for exercise 2.14. Assume that (Xn ) is stationary. Using that θ is X(P )-measure
preserving, argue that all Xn have the same distribution and conclude that EXn = EXk for
all n, k ≥ 1. Using a similar argument, argue that for all 1 ≤ n ≤ k, (Xn , Xk ) has the same
distribution as (X1 , Xk−(n−1) ) and conclude that Cov(Xn , Xk ) = Cov(X1 , Xk−(n−1) ). Use
this to conclude that (Xn ) is weakly stationary. ◦

Hints for exercise 2.15. Use Exercise 2.14 to argue that if (Xn ) is stationary, it is also weakly
stationary. To obtain the converse implication, assume that (Xn ) is stationary and argue that
for all n ≥ 1, (X2 , . . . , Xn+1 ) has the same distribution as (X1 , . . . , Xn ). Combine this with
the assumption that (Xn ) has Gaussian finite-dimensional distributions in order to obtain
stationarity. ◦
246 Hints for exercises

B.3 Hints for Chapter 3

Hints for exercise 3.1. First assume that (θn ) converges with limit θ. In the case where
θ > 0, apply Lemma 3.1.9 to obtain weak convergence. In the case where θ = 0, prove weak
R
convergence directly by proving convergence of f dµn for f ∈ Cb (R).

Next, assume that (µn ) is weakly convergent. Use Lemma 3.1.6 to argue that (θn ) is bounded.
Assume that (θn ) is not convergent, and argue that there must exist two subsequences (θnk )
and (θmk ) with different limits θ and θ∗ . Use what was already shown and Lemma 3.1.5 to
obtain a contradiction. ◦

Hints for exercise 3.2. To obtain weak convergence when the probabilities converge, apply
Lemma 3.1.9. To obtain the converse implication, use Lemma 3.1.3 to construct for each k a
mapping in Cb (R) which takes the value 1 on k and takes the value zero for, say (k −1, k +1)c .
Use this mapping to obtain convergence of the probabilities. ◦

n
Hints for exercise 3.3. Applying Stirlings’s formula and the fact that limn→∞ 1 + nx = ex
for all x ∈ R to prove that the densities fn converges pointwise to the density of the normal
disribution. Invoke Lemma 3.1.9 to obtain the desired result. ◦

Hints for exercise 3.4. Using Stirling’s formula as well as the result that if (xn ) is a sequence
n
converging to x, then limn→∞ 1 + xnn = ex , prove that the probability functions converge
pointwise. Apply Lemma 3.1.9 to obtain the result. ◦

Hints for exercise 3.5. Apply Stirling’s formula and Lemma 3.1.9. ◦

Hints for exercise 3.6. Let Fn be the cumulative distribution function for µn . Using the
properties of cumulative distribution functions, show that for x ∈ R satisfying the inequalities
q(k/(n + 1)) < x < q((k + 1)/(n + 1)), with k ≤ n, it holds that |Fn (x) − F (x)| ≤ 2/(n + 1).
Also show that for x < q(1/(n + 1)) and x > q(n/(n + 1)), Fn (x) − F (x)| ≤ 1/(n + 1). Then
apply Theorem 3.2.3 to obtain the result. ◦

Hints for exercise 3.7. First assume that (ξn ) and (σn ) converge to limits ξ and σ. In the
case where σ > 0, apply Lemma 3.1.9 to obtain weak convergence of µn . In the case where
σ = 0, use Theorem 3.2.3 to obtain weak convergence.

Next, assume that µn converges weakly. Use Lemma 3.1.6 to show that both (ξn ) and (σn )
B.3 Hints for Chapter 3 247

are bounded. Then apply a proof by contradiction to show that ξn and σn both must be
convergent. ◦

Hints for exercise 3.8. Let ε > 0, and take n so large that |xn − x| ≤ ε. Use Lemma 3.2.1
and the monotonicity properties of cumulative distribution functions to obtain the set of
inequalities F (x − ε) ≤ lim inf n→∞ Fn (xn ) ≤ lim supn→∞ Fn (xn ) ≤ F (x + ε). Use this to
prove the desired result. ◦

Hints for exercise 3.9. Argue that with Fn denoting the cumulative distribution function for
µn , it holds that Fn (x) = 1 − (1 − 1/n)[nx] , where [nx] denotes the integer part of nx. Use
l’Hôpital’s rule to prove pointwise convergence of Fn (x) as n tends to infinity, and invoke
Theorem 3.2.3 to conclude that the desired result holds. ◦

Hints for exercise 3.10. Use the binomial theorem. ◦

Hints for exercise 3.11. Use the Taylor expansion of the exponential function. ◦

Hints for exercise 3.12. Use independence of X and Y to express the characteristic function
of XY as an integral with respect to µ ⊗ ν. Apply Fubini’s theorem to obtain the result. ◦

Hints for exercise 3.13. Argue that XY and −ZW are independent and follow the same
distribution. Use Lemma 3.4.15 to 3.13 to express the characteristic function of XY − ZW
in terms of the characteristic function of XY . Apply Exercise 3.12 and Example 3.4.10 to
obtain a closed expression for this characteristic function. Recognizing this expression as the
characteristic function for the Laplace distribution, use Theorem 3.4.19 to obtain the desired
distributional result. ◦

pPn
Hints for exercise 3.14. Define a triangular array by putting Xnk = (Xn −EXn )/ k=1 V Xk .
Pn
Apply Theorem 3.5.6 to show that k=1 Xnk converges to the standard normal distribution,
and conclude the desired result from this. ◦

Hints for exercise 3.15. Fix ε > 0. Use independence and Lemma 1.3.12 to conclude that
P∞ P∞
n=1 P (|Xn | > ε) converges. To argue that V Xn 1(|Xn |≤ε) , assume that the series is
Pnn=1
divergent. Put Yn = Xn 1(|Xn |≤ε) and Sn = k=1 Yn . Use Exercise 3.15 to argue that Sn
√
converges almost surely, while (Sn − ESn )/ V Sn converges in distribution to the standard
√
normal distribution. Use Lemma 3.3.2 to conclude that ESn / V Sn converges in distribution
to the standard normal distribution. Obtain a contradiction from this. For the convergence
of the final series, apply Theorem 1.4.2. ◦
248 Hints for exercises

Hints for exercise 3.16. Use Theorem 3.5.3 to obtain that using the probability space
(Ω, F, Pλ ), X n is asymptotically normal. Fix a differentiable mapping f : R → R, and
use Theorem 3.6.3 to show that f (X n ) is asymptotically normal as well. Use the form of the
asymptotic parameters to obtain a requirement on f 0 for the result of the exercise to hold.
Identify a function f satisfying the requirements from this. ◦

Hints for exercise 3.17. In the case α > 1/2, use Lemma 1.2.7 to obtain the desired con-
vergence in probability. In the cas α ≤ 1/2, note by Theorem 3.5.3, that (Sn − nξ)/n1/2
converges in distribution to the standard normal distribution. Fix ε > 0 and use Lemma 3.1.3
to obtain a mapping g ∈ Cb (R) such that 1(ξ−2ε,ξ+2ε)c (x) ≤ g(x) ≤ 1[ξ−ε,ξ+ε]c (x). Use this
to prove that lim inf n→∞ P (|Sn − nξ|/nα ≥ ε) is positive, and conclude that (Sn − nξ)/nα
does not converge in probability to zero. ◦

Hints for exercise 3.18. First calculate EXn2 and EXn4 . Use Theorem 3.5.3, to obtain that X n
is asymptotically normal. Apply Theorem 3.6.3 to obtain that θ̂n is asymptotically normal
as well. ◦

Hints for exercise 3.19. Use Theorem 3.5.3 and Theorem 3.6.3 to obtain the results on X n
−1 P
and X n . In order to show Yn −→ 1/µ, calculate EYn and V Yn and use Lemma 1.2.7 to
prove the convergence. ◦

Hints for exercise 3.20. Use Theorem 3.5.3 to argue that X n is asymptotically normal. To
p
obtain the
√
result on (Yn − θ)/( 4θ2 /9n), define a triangular array (Xnk )n≥k≥1 by putting
36
Xnk = θn3/2 (kUk − kθ2 ) and apply Theorem 3.5.7. ◦

B.4 Hints for Chapter 4

Hints for exercise 4.1. Use that ν1 and ν2 are signed measures to show that αν1 +βν2 satisfies
(i) and (ii) in the definition. ◦

Hints for exercise 4.2. For µ τ : Use that τ (A) = 0 if and only if A = ∅.

For the density f : Recall that any function g : N0 → R is measurable and

Z X
g dτ = g(a) for A ∈ P(N0 )
A a∈A
P
and write µ(A) = a∈A f (a) for a suitable function f .
B.4 Hints for Chapter 4 249

Finally show that ν µ and find a counter example to µ ν. ◦

Hints for exercise 4.3. Define ν = ν1 + ν2 and argue that ν µ. Argue that there exists
R R
measurable and µ–integrable functions f, f1 , f2 with ν(F ) = F f dµ, ν1 (F ) = F f1 dµ, and
R
ν2 (F ) = F f2 dµ for all F ∈ F. Show that
Z Z Z
f dµ = f1 dµ + f2 dµ for all F ∈ F
F F F
and conclude the desired result. ◦

dν dν
Hints for exercise 4.4. Argue that ν µ and let f = dµ , h = dπ , and g = dπ
dµ . Note
that π and g are non–negative measure and density as known from Sand1. Show that
R R
ν(F ) = F hg dµ and ν(F ) F f dµ. ◦

dν
and g = dµ
R
Hints for exercise 4.5. Let f = dµ dν . Show that ν(F ) = F f g dν and ν(F ) =
R
F
1 dν. Conclude the desired result ν–a.e. Obtain the result µ–a.e. by symmetry. ◦

Hints for exercise 4.6. Find a disjoint sequence of sets (Fn ) such that µ(Fn ) < ∞ and
S
Fn = Ω. Define the measures µn (F ) = µ(F ∩ Fn ) and νn (F ) = ν(F ∩ Fn ) and show that
dνn P∞
νn µn for all n. Let fn = dµ n
and fn = 0 on Fnc (why is that OK?). Define f = n=1 fn .
Show that
Z ∞ Z
X Z
|f | dµ = · · · = fn dµ − fn dµ ≤ · · · ≤ 2 sup |ν(F )| < ∞
n=1 (fn >0) (fn <0) F ∈F
R
and that ν(F ) = F
f dµ. ◦

Hints for exercise 4.7. Show that X is a conditional expectation of Y given D, and that Y
is a conditional expectation of X given D. ◦

Hints for exercise 4.8. Straightforward application of Theorem 4.2.6 (2), (5) and (7). ◦

Hints for exercise 4.9. ”⇐” is trivial. For ”⇒” show that EX 2 = EY 2 and use that X = Y
a.s. if and only if E[(X − Y )2 ] = 0. Apply Theorem 4.2.6. ◦

Hints for exercise 4.10. Straightforward calculations! ◦

Hints for exercise 4.11. Recall that x+ = max{x, 0}. Show and use P (0 ≤ E(X + |D)) = 1
and P (E(X|D) ≤ E(X + |D)) = 1. ◦

Hints for exercise 4.12. Use that |x| = x+ + x− and (−x)+ = x− and Exercise 4.11. ◦
250 Hints for exercises

Hints for exercise 4.13. Show that E(X|D) = 21 . Show and use that if D is countable, then
1D = 0 a.s. and 1Dc = 1 a.s. ◦

Hints for exercise 4.14. Compare σ(X1 ) and σ(X2 ). ◦

Hints for exercise 4.15. Define

Z Z
H = {F ∈ D | Y dP = X dP }
D D

and use Dynkin’s lemma (Lemma A.2.2) to show that σ(G) ⊆ H (it is assumed that G ⊆ H).
◦

Hints for exercise 4.16.

(1) Write E(Y |Z) = φ(Z) (!) so e.g. the left hand side equals E(φ(Z)1(Z∈B) 1(X∈C) ). Use
that Z ⊥
⊥ X and (Y, Z) ⊥⊥ X.

(2) Use Exercise 4.15 and 1 to show that E(Y |Z) is a conditional expectation of Y given
σ(Z, X).

Hints for exercise 4.17.

(1) Show that σ(Sn , Sn+1 , Sn+2 , . . .) = σ(Sn , Xn+1 , Xn+2 , . . .). Use Exercise 4.16 and that
the Xn –variables are independent.

(2) First show that n1 Sn = E(X1 |Sn ) by checking the conditions for being a conditional
expectation of X1 given Sn . For the proof of
Z Z
1
Sn dP = X1 dP for all B ∈ B ,
(Sn ∈B) n (Sn ∈B)

D
use (and show) that 1(Sn ∈B) Xk = 1(Sn ∈B) X1 (the distributions are equal) for all k =
1, . . . , n.

◦
B.5 Hints for Chapter 5 251

Hints for exercise 4.18. Write X = Z + µ1 + cY , where Z = (X − µ1 ) − cY and c is chosen

such that Cov(Z, Y ) = 0 (Recall that in that case Z and Y will be independent!) ◦

B.5 Hints for Chapter 5

Hints for exercise 5.5. For ⇒, let Fn = (τ = n). For ⇐, show and use that

(τ = m) = ∩m c
n=1 Fn ∩ Fm .

Hints for exercise 5.6. Use the partition (τ = m) = ∪m−1

k=1 (τ = m, σ = k). ◦

Hints for exercise 5.7. (1): For ⇐, let τ = n and k = n + 1. For ⇒, use Corollary 5.2.13. ◦

Hints for exercise 5.8. For (2): Use Exercise 5.4. For 3: Show E(S1 ) = 0 and use (2) in
a.s.
Exercise 5.7. For (4): Use The Strong of Large Numbers to show that Sn −→ +∞. For 5:
Use monotone convergence of both sides in (3). ◦

Hints for exercise 5.9. First show that (Sn , Fn ) is a martingale with a suitable choice of (Fn ).
P∞
Then use the independence to show that ESn2 ≤ n=1 EXn2 < ∞ for all n ∈ N. Finally
use The martingale convergence theorem (for the argument of supn ESn+ < ∞, recall that
x+ ≤ |x| ≤ x2 + 1). ◦

Hints for exercise 5.10. For (2): See that Mn ≥ 0 and use Theorem 5.3.2. For (3): Use
Fatou’s lemma. For (5): Use Exercise 5.4. For (7): Use Corollary 5.2.13 and the fact that
τ ∧ n and n are bounded stopping times. For (9): Use (7)+(8)+ dominated convergence. For
(10): Let q = P (Sτ = b) and write EMτ = qrb + (1 − q)ra . ◦

Hints for exercise 5.11. For (1): Exploit the inequality (x−y)2 ≥ 0. For the integrability, use
that 1Dn E(X|D) is bounded by n. For (2): use that both 1Dn and E(X|D) are D–measurable.
For (3): Use (1) and (2) to obtain that

E 1Dn X 2 − E(X|D)2 |D ≥ 0 a.s.

Hints for exercise 5.12.

252 Hints for exercises

(1) Use Exercise 5.11 and that E(Xn+1 |Fn )2 = Xn2 a.s. by the martingale property.

(2) Use Exercise 5.4.

(3) Use Corollary 5.2.13, since τ ∧ n and n are bounded stopping times.

(4) Write
Z Z
EXτ2∧n = Xτ2∧n dP + Xτ2∧n dP
(maxk=1,...,n |Xk |≥) (maxk=1,...,n |Xk |<)

and use (and prove) that |Xτ ∧n | ≥ on the set (maxk=1,...,n |Xk | ≥ ).

Hints for exercise 5.13. For (3): Show that An ∈ Fτ ∧n (recall the definition of Fτ ∧n ) and
use this to show Z Z Z
Yτ ∧n dP ≤ E(Yn |Fτ ∧n ) dP = Yn dP .
An An An
◦

Hints for exercise 5.14. (1): Use Theorem 5.4.5 and the fact that E|Xn − 0| = EXn .
(2): According to (1), the variables should have both positive and negative values. Use linear
combinations of indicator functions like 1[0, n1 ) and 1[ n1 , n2 ) . ◦

Hints for exercise 5.15. For (1): Use 10 in Theorem 4.2.6 and the definition of conditional
expectations. For (2): First divide into the two situations |X| ≤ K and |X| > K, and
secondly use Markov’s inequality. For (3): Obtain that for all K ∈ N
Z Z
lim sup |E(X|D)| dP ≤ |X| dP .
x→∞ D∈G |(E(X|D)|>x) (|X|>K)

Let K → ∞ and use dominated convergence. ◦

Hints for exercise 5.16.

(1) This IS very easy.

P
(2) Use Theorem 5.4.5 and that Xτ ∧n −→ Xτ .

(3) Show that EXτ ∧n → EXτ (use e.g. the remark before Definition 5.4.1) and that
EXτ ∧n = EX1 .
B.5 Hints for Chapter 5 253

(4) Show and use that

Z Z
|Xτ ∧n | dP ≤ |Y | df P
(|Xτ ∧n |>x) (|Y |>x)

for all x > 0 and n ∈ N. Use dominated convergence to see that the right hand side
→ 0 as x → ∞.

(5) Write
∞
X ∞
X
P (τ > n) = 1(k≥n) P (τ = k)
n=0 n=0

and interchange the sums.

(6) Define Y as the right hand side of (5.32) and expand E|Y |:
∞ Z
X
E|Y | = . . . = E|X1 | + E(|Xm+1 − Xm | Fm ) dP
m=1 (τ >m)

Use E(|Xn+1 − Xn | Fn ) ≤ B a.s. and (5) to obtain that E|Y | < ∞. Use (1)-(4).

(9) Use (6)-(8) to show that EZσ = 0. Furthermore realise that Zσ = Sσ − σξ.

Hints for exercise 5.17. For (2): Use that all Yn ≥ 0. For (3): Write Y = lim inf n→∞ Yn
Pn
and use Fatou’s Lemma. For 4: Write Yn = exp k=1 log(Xk ) and use The Strong Law
Pn L1
of Large Numbers to show n1 k=1 log(Xk ) → ξ < 0 a.s.. For 5: Use that if Yn −→ Z
P L1
then Yn −→ Z and conclude that Z = 0 a.s. Realise that Yn 9 0 (You might need that if
P P
Yn −→ Y and Yn −→ Z, then Y = Z a.s.). ◦

Hints for exercise 5.18.

(1) Use the triangle inequality to obtain |E|Xn | − E|X|| ≤ E|Xn − X|. For the second
claim use Theorem 1.2.8.
L1
(3) Define Un = Xn − X and Vn = |Xn | + |X|. Showing Xn −→ X will be equivalent (?)
to show lim supn→∞ E|Xn − X| = 0. Argue and use that lim supn→∞ |Xn − X| = 0 a.s.

(4) Let (n` ) be a subsequence such that

`→∞
E|Xn` − X| −→ lim sup E|Xn − X|
n→∞
254 Hints for exercises

(such a subsequence will always exist). Find a subsequence (n`k ) of (n` ) such that
a.s.
Xn`k −→ X .

Conclude from (3) that E|Xn`k − X| → 0 and use the uniqueness of this limit to derive
lim supn→∞ E|Xn − X| = 0.

(5) Use (4) and Theorem 5.4.5 to conclude that (Yn ) is uniformly integrable. Use (2) in
P P
Thm 5.4.7 (You might need that if Yn −→ Y and Yn −→ Z, then Y = Z a.s.).

(6) For ⇐, do as in 5 and use furthermore Thm 5.4.7 to conclude that Y closes.
For ⇒, use 5.4.7, 5.4.5, and (1).

Hints for exercise 5.19.

(3) Use that if τ1 = n then we have lost the first n − 1 games and won the n’th games,
such that (like the example in the exercise)
n−1
X n−1
X
X1 = −1, X2 = −1 − 2, . . . , Xn−1 = 2k−1 , Xn = 2k−1 + 2n−1 = 1
k=1 k=1

(5) It may be useful to recall that (−Xn , Fn ) is a submartingale, and (−Xτk , Fτk ) is a
supermatingale.

6: Show and use that ((−Xn )+ ) is uniformly integrable.

Hints for exercise 5.20.

(1) Use x+ ≤ 1 + |x|p to show sup EXn+ < ∞.

Lp
(2) For E|X|p < ∞, apply Fatou’s lemma to E|X|p = E lim inf n |Xn |p . For Xn → X, show

that |Xn − X|p ≤ 2p supn |Xn |p and use dominated convergence on E |Xn − X|p .
R∞
(3) Hint: Write P (Z ≥ t) = 0 1[t,∞) (x) dZ(P )(x) and apply Tonelli’s Theorem to the
right hand side. Use that 1[t,∞) (x) = 1[0,x] (t).
B.5 Hints for Chapter 5 255

(4) Combine (3) and Doob’s inequality from Exercise 5.13.

(5) Apply Toenlli’s theorem to the double integral in (4).

1
(6) Apply Hölder to E(Mnp−1 |Xn |) (use that p/(p−1) + p1 = 1). Combine with the inequality
from (5).

(7) Write E supn |Xn |p = E limn Mnp and use monotone convergence together with the
inequality from (6).

Hints for exercise 5.21.

(2) Realize that if τa,b > nm, then especially

Sm , S2m , S3m , . . . , Snm ∈ (a, b)

Which leads to the conclusion that

|Sm − S0 | < b − a, |S2m − Sm | < b − a, . . . , |Snm − S(n−1)m | < b − a .

Use that all these differences are independent and identically distributed.

(3) Use (2). Choose a fixed m such that P (|Sm | < b − a) < 1 and let n → ∞. The second
statement is trivial.

(4) For the first result, use optional sampling for the martingale (Sn , Fn ) and e.g. the
two bounded stopping times 1 and τa,b ∧ n. For the second result, let n → ∞ using
dominated (since Sτa,b ∧n is bounded) and monotone convergence.

(5) Write ESτa,b = aP (Sτa,b = a) + bP (Sτa,b = b). Use (3).

(6) Apply the arguments from (4) to the martingale (Sn2 −n, Fn ). For the second statement,
use that the distribution of Sτa,b is well–known from (5).

(7) Use (3).

(8) Use that on F we have τ−n,b = τb if and only if Sτ−n,b = b.

(9) Use and show that if ω ∈ G, then τb (ω) < ∞.

(10) Use that τ−n,b ↑ τb as n → ∞.

256 Hints for exercises

(11) See that ESτb 6= ES1 and compare with Theorem 5.4.9.

(12) Write
∞
\
(sup Sn = ∞) = (τn < ∞)
n∈N n=1

Hints for exercise 5.22.

(2): Define the triangular array (Xnm ) by

1
Xnm = √ Zm−1 Ym
nα2 σ 2
and use Brown’s Theorem to show that
1 wk
√ Mn −→ N (0, 1) ,
nα2 σ 2

(4) Recall that EYn = 0.

(5) Show that supn∈N ENn2 < ∞ and use the

(6) Use appropriate theorems from Chapter 5.

(9) Use the almost sure convergence from (6) and Kronecker’s lemma.

(11) Use the result from (2) with Zm = Ym for all m ≥ 0. Use (10) to see that the
assumptions are fulfilled.

B.6 Hints for Chapter 6

Hints for exercise 6.1. Find the distribution of (X̃t1 , . . . , X̃tn ). ◦

Hints for exercise 6.2. Show the result in two steps:

(1) Show that P (D1 ∩ D2 ) = P (D1 )P (D2 ) for all D1 ∈ D1 , D2 ∈ σ(D2 ).

B.6 Hints for Chapter 6 257

(2) Show (6.20).

For (1): Let D1 ∈ D1 and define

ED1 = {F ∈ σ(D2 ) : P (D1 ∩ F ) = P (D1 )P (F )}

and then show that ED1 = σ(D2 ) using Lemma A.2.2. Note that you already have D2 ⊆
ED1 ⊆ σ(D2 ).

For (2): Let A ∈ σ(D2 ) and define

EA = {F ∈ F(D1 ) : P (F ∩ A) = P (F )P (A)} .

Show that EA = σ(D2 ) (as for (1)). ◦

Hints for exercise 6.3. Show that D ⊥

⊥ σ(Xu − Xt ) and use Exercise 6.2. ◦

Hints for exercise 6.4. Write Xt = Xt − Xs + Xs and use Exercise 6.3. ◦

Hints for exercise 6.6. Show that the finite–dimensional distributions are the same. ◦

Hints for exercise 6.7.

(1) Find H > 0 such that for all γ > 0 and all 0 ≤ t1 < · · · < tn
D
(Xγt1 , . . . , Xγtn ) = (γ H Xt1 , . . . , γ H Xtn ) .

(2) Use both of the assumptions: stationary increments and self–similarity.

(3) Use (2) and the assumption P (X1 = 0) = 0.

(4) You need to show that for some t ≥ 0 and tn → t, then for all > 0

lim P (|Xtn − Xt | > ) = 0 .

n→∞

Use (2).

Hints for exercise 6.8.

258 Hints for exercises

(1) Write Cn,M on the form

n xq 1 o [
Cn,M = x ∈ C[0,∞) sup > = (|X̃q | . . .)
q∈[n,n+1]∩Q q M
q∈[n,n+1]∩Q

(2) Use that

|Yq | 1
(Y ∈ Cn,M ) = sup >
q∈[n,n+1]∩Q q M
and Lemma 6.2.9.

(3) For the first result use Markov’s inequality and that EU 4 = 3 if U ∼ N (0, 1).

(4) Use Borel–Cantelli.

(5) Start showing that !

|Yt | 1
P sup ≤ evt. = 1.
t∈[n,n+1] t M
for all M ∈ N.

(6) Use (5).

B.7 Hints for Appendix A

Hints for exercise A.1. To obtain the supremum, use that weak inqeualities are preserved by
limits. ◦

Hints for exercise A.2. To obtain the supremum, use Lemma A.1.3 and the fact that Q is
dense in R. ◦

Hints for exercise A.3. Apply Fatou’s lemma to the sequence (g − fn )n≥1 . ◦
Index

Cb (R), 61 density, 106, 107

Cb∞ (R), 84 given as integral, 106
Cb∞ (Rd ), 95 properties, 104
Cbu (R), 61 Radon-Nikodym derivative, 107
C[0,∞) , 206 relation to positive measure, 105
[ · ], 238 singularity, 107
BC , 72 Brownian motion, 194
B[0,∞) , 192 continuous version, 201
C[0,∞) , 207 drift, 194
I(X), 48 existence, 194
IT , 36 finite-dimensional distribution, 195
R∗ , 229 law of the iterated logarithm, 217
R[0,∞) , 192 normalised, 194, 223
σ-algebra, 2, 233 points with value 0, 221
Borel, 2 quadratic variation of, 212
generated by a family of sets, 2 variance, 194
generated by a family of variables, 3 variation of, 212
infinite-dimensional Borel, 45 Bump function, 61

Adapted sequence, 137 Cauchy sequence, 14

Asymptotic normality, 92 Central limit theorem
and convergence in probability, 92 classical, 87, 88, 91
and transformations, 93 Lindeberg’s, 89
Lyapounov’s, 91
Bimeasurable map, 125 martingale, 172
Birkhoff-Khinchin ergodic theorem, 38 Change-of-variable formula, 237
Borel-Cantelli lemma, 12, 20 Characteristic function, 75
Bounded, signed measure, 104 and convolutions, 80
absolute continuity, 107 and linear transformations, 77
concentrated on set, 107 and the exponential distribution, 78
continuity, 104 and the Laplace distribution, 79
260 INDEX

and the normal distribution, 78 Dirac measure, 66

properties, 75 Dominated convergence theorem, the, 236
uniqueness of, 83 Doob’s Inequality, 179
Chebychev–Kolmogorov inequality, the, 179 Down-crossing, 146
Closing of martingales, 156 and convergence, 146
Complex conjugate, 72 lemma, 149
Conditional expectation number of, 148
and independence, 120 Drift of Brownian motion, 194
and monotone convergence, 120 Dynkin’s lemma, 2, 234
existence, 117
Ergodic theorem for stationary processes, 49
given σ-algebra, 116
Ergodicity, 36
given Y = y, 127
of a stochastic process, 47, 48
given finite σ-algebra, 118
sufficient criteria for, 40–42
given random variable, 124
Eventually, 11
Jensen’s inequality for, 120
properties, 119 Fatou’s lemma, 235
uniform integrability, 180 Favourable game, 134
uniqueness, 117 Filtration, 137
Confidence interval, 94, 98 Finite-dimensional distribution of stochastic
Continuity in probability, 200 process, 193
Convergence Fubini’s theorem, 236
Almost surely, 12
almost surely, 4 Gambling theory, 134
and limes superior and limes inferior, 232
Hölder’s inequality, 237
completeness of modes of, 14
in Lp , 4 Image measure, 237
in distribution, 4 Independence
in law, 4 of σ-algebras, 15
In probability, 12 of events, 17
in probability, 4 of random variables, 17
Khinchin-Kolmogorov theorem, 22 of transformed variables, 18
relationship between modes of, 8 sufficient criteria for, 16
stability properties, 7, 10 Infimum, 229
uniqueness of limits, 6 Infinitely often, 11
weak, of probability measures, 60 Integer part function, 238
Convolution, 79 Integral
of complex functions, 72
Delta method, the, 93 Invariant σ-algebra, 36
INDEX 261

of stationary process, 48 Measurable space, 234

Invariant random variable, 36 Measure, 234
measurability, 36 bounded, positive, 104
Iterated logarithm, law of, 217 bounded, signed, 104
Measure preservation, 36
Jacobian, 96
sufficient criteria for, 40
Jensen’s inequality, 236
Measure space
for conditional expectation, 120
σ-finite, 234
Jordan-Hahn decomposition, the, 108
Mixing, 42
Khinchin-Kolmogorov theorem, the, 22 Monotone convergence theorem, the, 235
Kolmogorov’s consistency theorem, 194, 239
Nowhere monotone function, 207
Kolmogorov’s three-series theorem, 24
Brownian motion, 212
Kolmogorov’s zero-one law, 19
Optional sampling, 141, 160
Law of the iterated logarithm, 217
at sequence of sampling times, 145
Lebesgue decomposition, the, 111
bounded stopping times, 144
Limes inferior, 231
infinite stopping times, 160
Limes superior, 231
Lower bound, 229 Probability measure, 2
σ-additivity, 2
Martingale, 137
downwards continuity, 3
central limit theorem, 172
of measurable functions, 235
closing of, 156
uniqueness, 234
continuous time, 224
upwards continuity, 3
convergence theorem, the, 146
Probability space, 2
integral definition of, 138
filtered, 137
optional sampling, 141
sub-, 137 Quadratic variation, 211
super-, 137 and variation, 211
Martingale difference, 165 of Brownian motion, 212
and martingales, 165
array, 166 Radon-Nikodym
compensator, 166 derivative, 107
square-integrable, 165 theorem, the, 114
Maximal ergodic lemma, 37 Random variable, 3
Maximal inequality p’th moment of, 3
Kolmogorov’s, 21 mean of, 3
Measurability, 3, 235 Random walk, 165
σ-algebra generated by variable, 125 Riesz-Fischer theorem, the, 238
262 INDEX

Sampling times, sequence of, 144 and L1 -convergence, 154

Scheffé’s lemma, 65 and closing, 156
Shift operator, 47 finite family of variables, 151
Slutsky’s lemma, 70 Uniform norm, 63
Stochastic process Upper bound, 229
adapted, 137 Urysohn function, 61
at infinite stopping time, 160
at stopping time, 140, 160 Variance of Brownian motion, 194
continuous-time, 192 Variation, 208
discrete-time, 3 and quadratic variation, 211
distribution of, 46, 193 bounded, 209
down-crossings, number of, 148 of Brownian motion, 212
finite-dimensional distribution, 193 of monotone function, 209
sample path, 197 properties, 208
stationary, 47 unbounded, 210
version of, 198 Version
Stopping time, 138 continuous, 200
σ-algebra, 140 of stochastic process, 198
finite, 138
Weak convergence of probability measures
optional sampling, 160
and characteristic functions, 84
Strategy, 134
and continuous transformations, 71
Strong law of large numbers, 27
and convergence in probability, 70
Submartingale, 137
and distribution functions, 67, 68
closing of, 156
examples, 66
convergence of, 146
in higher dimensions, 95
integral definition of, 138
relation to convergence of variables, 60
Sum of independent variables, 21 stability properties, 64
divergence of, 21 sufficient criteria for, 84
Supermartingale, 137 uniqueness of limits, 62
Supremum, 229 Weak mixing, 42
Weak stationarity, 56
Tail σ-algebra, 19
Taylor expansion, 239
Taylor’s theorem, 239
Tightness, 63
Tonelli’s theorem, 236
Triangular array, 88, 166

Uniform integrability, 151

Bibliography

R. B. Ash: Real analysis and Probability, Academic Press, 1972.

T. M. Apostol: Calculus, Volume 1, Blaisdell Publishing Company, 1964.

L. Breiman: Probability, Addison-Wesley, 1968.

M. Loève: Probability Theory I, Springer-Verlag, 1977.

M. Loève: Probability Theory II, Springer-Verlag, 1977.

P. Brémaud: Markov Chains: Gibbs fiends, Monte Carlo simulation and queues, Springer-
Verlag, 1999.

S. N. Ethier & T. G. Kurtz: Markov Processes: Characterization and Convergence, Wiley,

1986.

J. R. Norris: Markov Chains, Cambridge University Press, 1999.

S. Meyn & R. L. Tweedie: Markov chains and stochastic stability, Cambridge University
Press, 2009.

P. Billingsley: Convergence of Probability Measures, 2nd edition, 1999.

K. R. Parthasarathy: Probability measures on Metric Spaces, Academic Press, 1967.

I. Karatzas, S. E. Shreve: Brownian Motion and Stochastic Calculus, Springer-Verlag, 1988.

N. L. Carothers: Real Analysis, Cambridge University Press, 2000.

E. Hansen: Measure theory, Department of Mathematical Sciences, University of Copen-

hagen, 2004.

O. Kallenberg: Foundations of Modern Probability, Springer-Verlag, 2002.

264 BIBLIOGRAPHY

L. C. G. Rogers & D. Williams: Diffusions, Markov Processes and Martingales, Volume 1:

Foundations, 2nd edition, Cambridge University Press, 2000.

L. C. G. Rogers & D. Williams: Diffusions, Markov Processes and Martingales, Volume 1:

Itô calculus, 2nd edition, Cambridge University Press, 2000.
MCA

SEMESTER - II

PROBABILITY &
STATISTICS
mca-5 230

1
PROBABILITY
INTRODUCTION TO PROBABILITY

Managers need to cope with uncertainty in many decision

making situations. For example, you as a manager may assume
that the volume of sales in the successive year is known exactly to
you. This is not true because you know roughly what the next year
sales will be. But you cannot give the exact number. There is some
uncertainty. Concepts of probability will help you to measure
uncertainty and perform associated analyses. This unit provides the
conceptual framework of probability and the various probability
rules that are essential in business decisions.
Learning objectives:
After reading this unit, you will be able to:
 Appreciate the use of probability in decision making
 Explain the types of probability
 Define and use the various rules of probability depending on
the problem situation.
 Make use of the expected values for decision-making.

Probability
Sets and Subsets
The lesson introduces the important topic of sets, a simple
idea that recurs throughout the study of probability and statistics.
Set Definitions
 A set is a well-defined collection of objects.
 Each object in a set is called an element of the set.
 Two sets are equal if they have exactly the same elements
in them.
 A set that contains no elements is called a null set or an
empty set.
 If every element in Set A is also in Set B, then Set A is a
subset of Set B.

Set Notation
 A set is usually denoted by a capital letter, such as A, B, or
C.
 An element of a set is usually denoted by a small letter, such
as x, y, or z.
 A set may be decribed by listing all of its elements enclosed
in braces. For example, if Set A consists of the numbers 2,
4, 6, and 8, we may say: A = {2, 4, 6, 8}.
 The null set is denoted by {∅}.
mca-5 231

 Sets may also be described by stating a rule. We could

describe Set A from the previous example by stating: Set A
consists of all the even single-digit positive integers.
Set Operations
Suppose we have four sets - W, X, Y, and Z. Let these sets be
defined as follows: W = {2}; X = {1, 2}; Y= {2, 3, 4}; and Z = {1, 2, 3,
4}.
 The union of two sets is the set of elements that belong to
one or both of the two sets. Thus, set Z is the union of sets X
and Y.
 Symbolically, the union of X and Y is denoted by X ∪ Y.
 The intersection of two sets is the set of elements that are
common to both sets. Thus, set W is the intersection of sets
X and Y.
 Symbolically, the intersection of X and Y is denoted by X ∩
Y.

Sample Problems
1. Describe the set of vowels.

If A is the set of vowels, then A could be described as A = {a,

e, i, o, u}.
2. Describe the set of positive integers.

Since it would be impossible to list all of the positive

integers, we need to use a rule to describe this set. We
might say A consists of all integers greater than zero.
3. Set A = {1, 2, 3} and Set B = {3, 2, 1}. Is Set A equal to Set
B?

Yes. Two sets are equal if they have the same elements.
The order in which the elements are listed does not matter.
4. What is the set of men with four arms?

Since all men have two arms at most, the set of men with
four arms contains no elements. It is the null set (or empty
set).
5. Set A = {1, 2, 3} and Set B = {1, 2, 4, 5, 6}. Is Set A a subset
of Set B?

Set A would be a subset of Set B if every element from Set A

were also in Set B. However, this is not the case. The
number 3 is in Set A, but not in Set B. Therefore, Set A is not
a subset of Set B.

Statistical Experiments
All statistical experiments have three things in common:
 The experiment can have more than one possible outcome.
 Each possible outcome can be specified in advance.
mca-5 232

 The outcome of the experiment depends on chance.

A coin toss has all the attributes of a statistical experiment. There is

more than one possible outcome. We can specify each possible
outcome (i.e., heads or tails) in advance. And there is an element of
chance, since the outcome is uncertain.

The Sample Space

 A sample space is a set of elements that represents all
possible outcomes of a statistical experiment.
 A sample point is an element of a sample space.
 An event is a subset of a sample space - one or more
sample points.

Types of events
 Two events are mutually exclusive if they have no sample
points in common.
 Two events are independent when the occurrence of one
does not affect the probability of the occurrence of the other.

Sample Problems
1. Suppose I roll a die. Is that a statistical experiment?

Yes. Like a coin toss, rolling dice is a statistical experiment.

There is more than one possible outcome. We can specify each
possible outcome in advance. And there is an element of
chance.
2. When you roll a single die, what is the sample space.

The sample space is all of the possible outcomes - an integer

between 1 and 6.
3. Which of the following are sample points when you roll a die - 3,
6, and 9?

The numbers 3 and 6 are sample points, because they are in

the sample space. The number 9 is not a sample point, since it
is outside the sample space; with one die, the largest number
that you can roll is 6.
4. Which of the following sets represent an event when you roll a
die?

A. {1}
B. {2, 4,}
C. {2, 4, 6}
D. All of the above

The correct answer is D. Remember that an event is a subset of

a sample space. The sample space is any integer from 1 to 6.
mca-5 233

Each of the sets shown above is a subset of the sample space,

so each represents an event.
5. Consider the events listed below. Which are mutually exclusive?

A. {1}
B. {2, 4,}
C. {2, 4, 6}
Two events are mutually exclusive, if they have no sample points in
common. Events A and B are mutually exclusive, and Events A and
C are mutually exclusive; since they have no points in common.
Events B and C have common sample points, so they are not
mutually exclusive.

6. Suppose you roll a die two times. Is each roll of the die an
independent event?

Yes. Two events are independent when the occurrence of one

has no effect on the probability of the occurrence of the other.
Neither roll of the die affects the outcome of the other roll; so
each roll of the die is independent.

Basic Probability
The probability of a sample point is a measure of the likelihood
that the sample point will occur.
Probability of a Sample Point
By convention, statisticians have agreed on the following rules.
 The probability of any sample point can range from 0 to 1.
 The sum of probabilities of all sample points in a sample
space is equal to 1.

Example 1
Suppose we conduct a simple statistical experiment. We flip a coin
one time. The coin flip can have one of two outcomes - heads or
tails. Together, these outcomes represent the sample space of our
experiment. Individually, each outcome represents a sample point
in the sample space. What is the probability of each sample point?
Solution: The sum of probabilities of all the sample points must
equal 1. And the probability of getting a head is equal to the
probability of getting a tail. Therefore, the probability of each
sample point (heads or tails) must be equal to 1/2.

Example 2
Let's repeat the experiment of Example 1, with a die instead of a
coin. If we toss a fair die, what is the probability of each sample
point?

Solution: For this experiment, the sample space consists of six

sample points: {1, 2, 3, 4, 5, 6}. Each sample point has equal
probability. And the sum of probabilities of all the sample points
mca-5 234

must equal 1. Therefore, the probability of each sample point must

be equal to 1/6.

Probability of an Event
The probability of an event is a measure of the likelihood that the
event will occur. By convention, statisticians have agreed on the
following rules.
 The probability of any event can range from 0 to 1.
 The probability of event A is the sum of the probabilities of all
the sample points in event A.
 The probability of event A is denoted by P(A).
Thus, if event A were very unlikely to occur, then P(A) would be
close to 0. And if event A were very likely to occur, then P(A) would
be close to 1.

Example 1
Suppose we draw a card from a deck of playing cards. What is the
probability that we draw a spade?

Solution: The sample space of this experiment consists of 52 cards,

and the probability of each sample point is 1/52. Since there are 13
spades in the deck, the probability of drawing a spade is
P(Spade) = (13)(1/52) = 1/4

Example 2
Suppose a coin is flipped 3 times. What is the probability of getting
two tails and one head?
Solution: For this experiment, the sample space consists of 8
sample points.
S = {TTT, TTH, THT, THH, HTT, HTH, HHT, HHH}
Each sample point is equally likely to occur, so the probability of
getting any particular sample point is 1/8. The event "getting two
tails and one head" consists of the following subset of the sample
space.
A = {TTH, THT, HTT}
The probability of Event A is the sum of the probabilities of the
sample points in A. Therefore,
P(A) = 1/8 + 1/8 + 1/8 = 3/8

Working With Probability

The probability of an event refers to the likelihood that the event
will occur.

How to Interpret Probability

Mathematically, the probability that an event will occur is expressed
as a number between 0 and 1. Notationally, the probability of event
A is represented by P(A).
 If P(A) equals zero, there is no chance that the event A will
occur.
mca-5 235

 If P(A) is close to zero, there is little likelihood that event A

will occur.
 If P(A) is close to one, there is a strong chance that event A
will occur
 If P(A) equals one, event A will definitely occur.
The sum of all possible outcomes in a statistical experiment is
equal to one. This means, for example, that if an experiment can
have three possible outcomes (A, B, and C), then P(A) + P(B) +
P(C) = 1.

How to Compute Probability: Equally Likely Outcomes

Sometimes, a statistical experiment can have n possible outcomes,
each of which is equally likely. Suppose a subset of r outcomes are
classified as "successful" outcomes.
The probability that the experiment results in a successful outcome
(S) is:

P(S) = ( Number of successful outcomes ) / ( Total number of

equally likely outcomes ) = r / n
Consider the following experiment. An urn has 10 marbles. Two
marbles are red, three are green, and five are blue. If an
experimenter randomly selects 1 marble from the urn, what is the
probability that it will be green?
In this experiment, there are 10 equally likely outcomes, three of
which are green marbles. Therefore, the probability of choosing a
green marble is 3/10 or 0.30.

How to Compute Probability: Law of Large Numbers

One can also think about the probability of an event in terms of its
long-run relative frequency. The relative frequency of an event is
the number of times an event occurs, divided by the total number of
trials.
P(A) = ( Frequency of Event A ) / ( Number of Trials )

For example, a merchant notices one day that 5 out of 50 visitors to

her store make a purchase. The next day, 20 out of 50 visitors
make a purchase. The two relative frequencies (5/50 or 0.10 and
20/50 or 0.40) differ. However, summing results over many visitors,
mca-5 236

she might find that the probability that a visitor makes a purchase
gets closer and closer 0.20.

The scatterplot (above right) shows the relative frequency as the

number of trials (in this case, the number of visitors) increases.
Over many trials, the relative frequency converges toward a stable
value (0.20), which can be interpreted as the probability that a
visitor to the store will make a purchase.

The idea that the relative frequency of an event will converge on

the probability of the event, as the number of trials increases, is
called the law of large numbers.

Test Your Understanding of This Lesson

Problem
A coin is tossed three times. What is the probability that it lands on
heads exactly one time?
(A) 0.125
(B) 0.250
(C) 0.333
(D) 0.375
(E) 0.500

Solution
The correct answer is (D). If you toss a coin three times, there are a
total of eight possible outcomes. They are: HHH, HHT, HTH, THH,
HTT, THT, TTH, and TTT. Of the eight possible outcomes, three
have exactly one head. They are: HTT, THT, and TTH. Therefore,
the probability that three flips of a coin will produce exactly one
head is 3/8 or 0.375.

Rules of Probability
Often, we want to compute the probability of an event from the
known probabilities of other events. This lesson covers some
important rules that simplify those computations.

Definitions and Notation

Before discussing the rules of probability, we state the following
definitions:
 Two events are mutually exclusive or disjoint if they
cannot occur at the same time.
 The probability that Event A occurs, given that Event B has
occurred, is called a conditional probability. The
conditional probability of Event A, given Event B, is denoted
by the symbol P(A|B).
 The complement of an event is the event not occuring. The
probability that Event A will not occur is denoted by P(A').
 The probability that Events A and B both occur is the
probability of the intersection of A and B. The probability of
mca-5 237

the intersection of Events A and B is denoted by P(A ∩ B). If

Events A and B are mutually exclusive, P(A ∩ B) = 0.
 The probability that Events A or B occur is the probability of
the union of A and B. The probability of the union of Events
A and B is denoted by P(A ∪ B) .
 If the occurence of Event A changes the probability of Event
B, then Events A and B are dependent. On the other hand,
if the occurence of Event A does not change the probability
of Event B, then Events A and B are independent.

Probability Calculator
Use the Probability Calculator to compute the probability of an
event from the known probabilities of other events. The Probability
Calculator is free and easy to use. It can be found under the Tools
menu item, which appears in the header of every Stat Trek web
page.

Probability Calculator

Rule of Subtraction
In a previous lesson, we learned two important properties of
probability:
 The probability of an event ranges from 0 to 1.
 The sum of probabilities of all possible events equals 1.
The rule of subtraction follows directly from these properties.

Rule of Subtraction The probability that event A will occur is equal

to 1 minus the probability that event A will not occur.
P(A) = 1 - P(A')
Suppose, for example, the probability that Bill will graduate from
college is 0.80. What is the probability that Bill will not graduate
from college? Based on the rule of subtraction, the probability that
Bill will not graduate is 1.00 - 0.80 or 0.20.

Rule of Multiplication
The rule of multiplication applies to the situation when we want to
know the probability of the intersection of two events; that is, we
want to know the probability that two events (Event A and Event B)
both occur.

Rule of Multiplication The probability that Events A and B both

occur is equal to the probability that Event A occurs times the
probability that Event B occurs, given that A has occurred.
P(A ∩ B) = P(A) P(B|A)

Example
An urn contains 6 red marbles and 4 black marbles. Two marbles
mca-5 238

are drawn without replacement from the urn. What is the probability
that both of the marbles are black?

Solution: Let A = the event that the first marble is black; and let B =
the event that the second marble is black. We know the following:
 In the beginning, there are 10 marbles in the urn, 4 of which
are black. Therefore, P(A) = 4/10.
 After the first selection, there are 9 marbles in the urn, 3 of
which are black. Therefore, P(B|A) = 3/9.

Therefore, based on the rule of multiplication:

P(A ∩ B) = P(A) P(B|A)
P(A ∩ B) = (4/10)*(3/9) = 12/90 = 2/15

Rule of Addition
The rule of addition applies to the following situation. We have two
events, and we want to know the probability that either event
occurs.

Rule of Addition The probability that Event A and/or Event B occur

is equal to the probability that Event A occurs plus the probability
that Event B occurs minus the probability that both Events A and B
occur.
P(A U B) = P(A) + P(B) - P(A ∩ B))
Note: Invoking the fact that P(A ∩ B) = P( A )P( B | A ), the Addition
Rule can also be expressed as
P(A U B) = P(A) + P(B) - P(A)P( B | A )

Example
A student goes to the library. The probability that she checks out (a)
a work of fiction is 0.40, (b) a work of non-fiction is 0.30, , and (c)
both fiction and non-fiction is 0.20. What is the probability that the
student checks out a work of fiction, non-fiction, or both?
Solution: Let F = the event that the student checks out fiction; and
let N = the event that the student checks out non-fiction. Then,
based on the rule of addition:
P(F U N) = P(F) + P(N) - P(F ∩ N)
P(F U N) = 0.40 + 0.30 - 0.20 = 0.50

Test Your Understanding of This Lesson

Problem 1
An urn contains 6 red marbles and 4 black marbles. Two marbles
are drawn with replacement from the urn. What is the probability
that both of the marbles are black?
(A) 0.16
(B) 0.32
(C) 0.36
(D) 0.40
(E) 0.60
mca-5 239

Solution
The correct answer is A. Let A = the event that the first marble is
black; and let B = the event that the second marble is black. We
know the following:
 In the beginning, there are 10 marbles in the urn, 4 of which
are black. Therefore, P(A) = 4/10.
 After the first selection, we replace the selected marble; so
there are still 10 marbles in the urn, 4 of which are black.
Therefore, P(B|A) = 4/10.

Therefore, based on the rule of multiplication:

P(A ∩ B) = P(A) P(B|A)
P(A ∩ B) = (4/10)*(4/10) = 16/100 = 0.16

Problem 2
A card is drawn randomly from a deck of ordinary playing cards.
You win $10 if the card is a spade or an ace. What is the probability
that you will win the game?
(A) 1/13
(B) 13/52
(C) 4/13
(D) 17/52
(E) None of the above.

Solution
The correct answer is C. Let S = the event that the card is a spade;
and let A = the event that the card is an ace. We know the
following:
 There are 52 cards in the deck.
 There are 13 spades, so P(S) = 13/52.
 There are 4 aces, so P(A) = 4/52.
 There is 1 ace that is also a spade, so P(S ∩ A) = 1/52.
Therefore, based on the rule of addition:
P(S U A) = P(S) + P(A) - P(S ∩ A)
P(S U A) = 13/52 + 4/52 - 1/52 = 16/52 = 4/13

Bayes' Theorem (aka, Bayes' Rule)

Bayes' theorem (also known as Bayes' rule) is a useful tool for
calculating conditional probabilities. Bayes' theorem can be stated
as follows:

Bayes' theorem. Let A1, A2, ... , An be a set of mutually exclusive

events that together form the sample space S. Let B be any event
from the same sample space, such that P(B) > 0. Then,
mca-5 240

P( Ak ∩ B )
P( Ak | B ) =
P( A1 ∩ B ) + P( A2 ∩ B ) + . . . + P( An ∩ B )

Note: Invoking the fact that P( Ak ∩ B ) = P( Ak )P( B | Ak ), Baye's

theorem can also be expressed as

Unless you are a world-class statiscian, Bayes' theorem (as

expressed above) can be intimidating. However, it really is easy to
use. The remainder of this lesson covers material that can help you
understand when and how to apply Bayes' theorem effectively.

When to Apply Bayes' Theorem

Part of the challenge in applying Bayes' theorem involves
recognizing the types of problems that warrant its use. You should
consider Bayes' theorem when the following conditions exist.

 The sample space is partitioned into a set of mutually

exclusive events { A1, A2, . . . , An }.
 Within the sample space, there exists an event B, for which
P(B) > 0.
 The analytical goal is to compute a conditional probability of
the form: P( Ak | B ).
 You know at least one of the two sets of probabilities
described below.
 P( Ak ∩ B ) for each Ak
 P( Ak ) and P( B | Ak ) for each Ak

Bayes Rule Calculator

Use the Bayes Rule Calculator to compute conditional probability,
when Bayes' theorem can be applied. The calculator is free, and it
is easy to use. It can be found under the Tools menu item, which
appears in the header of every Stat Trek web page.

Bayes Rule Calculator

Sample Problem
Bayes' theorem can be best understood through an example. This
section presents an example that demonstrates how Bayes'
theorem can be applied effectively to solve statistical problems.
mca-5 241

Example 1
Marie is getting married tomorrow, at an outdoor ceremony in the
desert. In recent years, it has rained only 5 days each year.
Unfortunately, the weatherman has predicted rain for tomorrow.
When it actually rains, the weatherman correctly forecasts rain 90%
of the time. When it doesn't rain, he incorrectly forecasts rain 10%
of the time. What is the probability that it will rain on the day of
Marie's wedding?

Solution: The sample space is defined by two mutually-exclusive

events - it rains or it does not rain. Additionally, a third event occurs
when the weatherman predicts rain. Notation for these events
appears below.

 Event A1. It rains on Marie's wedding.

 Event A2. It does not rain on Marie's wedding
 Event B. The weatherman predicts rain.
In terms of probabilities, we know the following:
 P( A1 ) = 5/365 =0.0136985 [It rains 5 days out of the year.]
 P( A2 ) = 360/365 = 0.9863014 [It does not rain 360 days out
of the year.]
 P( B | A1 ) = 0.9 [When it rains, the weatherman predicts rain
90% of the time.]
 P( B | A2 ) = 0.1 [When it does not rain, the weatherman
predicts rain 10% of the time.]
We want to know P( A1 | B ), the probability it will rain on the day of
Marie's wedding, given a forecast for rain by the weatherman. The
answer can be determined from Bayes' theorem, as shown below.

P( A1 ) P( B | A1 )
P( A1 | B ) =
P( A1 ) P( B | A1 ) + P( A2 ) P( B | A2 )
P( A1 | B ) = (0.014)(0.9) / [ (0.014)(0.9) + (0.986)(0.1) ]
P( A1 | B ) = 0.111
Note the somewhat unintuitive result. When the weatherman
predicts rain, it actually rains only about 11% of the time. Despite
the weatherman's gloomy prediction, there is a good chance that
Marie will not get rained on at her wedding.
This is an example of something called the false positive paradox. It
illustrates the value of using Bayes theorem to calculate conditional
probabilities.

Probability
For an experiment we define an event to be any collection of
possible outcomes.
A simple event is an event that consists of exactly one outcome.
or: means the union i.e. either can occur
and: means intersection i.e. both must occur
mca-5 242

Two events are mutually exclusive if they cannot occur

simultaneously.
For a Venn diagram, we can tell that two events are mutually
exclusive if their regions do not intersect
We define Probability of an event E to be to be

number of simple events within E

P(E) =
total number of possible outcomes

We have the following:

1. P(E) is always between 0 and 1.
2. The sum of the probabilities of all simple events must be 1.
3. P(E) + P(not E) = 1
4. If E and F are mutually exclusive then

P(E or F) = P(E) + P(F)

The Difference Between And and Or

If E and F are events then we use the terminology
E and F
to mean all outcomes that belong to both E and F

We use the terminology

E Or F
to mean all outcomes that belong to either E or F.

Example
Below is an example of two sets, A and B, graphed in a Venn
diagram.

The green area represents A and B while all areas with color
represent A or B

Example
Our Women's Volleyball team is recruiting for new members.
Suppose that a person inquires about the team.
mca-5 243

Let E be the event that the person is female

Let F be the event that the person is a student
then E And F represents the qualifications for being a member of
the team. Note that E Or F is not enough.
We define

Definition of Conditional Probability

P(E and F)
P(E|F) =
P(F)

We read the left hand side as

"The probability of event E given event F"
We call two events independent if
For Independent Events
P(E|F) = P(E)

Equivalently, we can say that E and F are independent if

For Independent Events

P(E and F) = P(E)P(F)

Example
Consider rolling two dice. Let
E be the event that the first die is a 3.
F be the event that the sum of the dice is an 8.
Then E and F means that we rolled a three and then we rolled a
5
This probability is 1/36 since there are 36 possible pairs and only
one of them is (3,5)
We have
P(E) = 1/6
And note that (2,6),(3,5),(4,4),(5,3), and (6,2) give F
Hence
P(F) = 5/36
We have
P(E) P(F) = (1/6) (5/36)
which is not 1/36.
We can conclude that E and F are not independent.
Exercise
Test the following two events for independence:
E the event that the first die is a 1.
F the event that the sum is a 7.
A Counting Rule
For two events, E and F, we always have
mca-5 244

P(E or F) = P(E) + P(F) - P(E and F)

Example
Find the probability of selecting either a heart or a face card from a
52 card deck.

Solution
We let
E = the event that a heart is selected
F = the event that a face card is selected
then
P(E) = 1/4 and P(F) = 3/13 (Jack, Queen, or King
out of 13 choices)
P(E and F) = 3/52
The formula gives
P(E or F) = 1/4 + 3/13 - 3/52 = 22/52 = 42%

Trees and Counting

Using Trees
We have seen that probability is defined by
Number in E
P(E) =
Number in the Sample Space
Although this formula appears simple, counting the number in each
can prove to be a challenge. Visual aids will help us immensely.

Example
A native flowering plant has several varieties. The color of the
flower can be red, yellow, or white. The stems can be long or short
and the leaves can be thorny, smooth, or velvety. Show all
varieties.

Solution
We use a tree diagram. A tree diagram is a diagram that branches
out and ends in leaves that correspond to the final variety. The
picture below shows this.

Outcome
An outcome is the result of an experiment or other situation
involving uncertainty.
The set of all possible outcomes of a probability experiment is
called a sample space.

Sample Space
The sample space is an exhaustive list of all the possible outcomes
of an experiment. Each possible result of such a study is
represented by one and only one point in the sample space, which
is usually denoted by S.
mca-5 245

Examples
Experiment Rolling a die once:
Sample space S = {1,2,3,4,5,6}
Experiment Tossing a coin:
Sample space S = {Heads,Tails}
Experiment Measuring the height (cms) of a girl on her first day at
school:
Sample space S = the set of all possible real numbers

Event
An event is any collection of outcomes of an experiment.
Formally, any subset of the sample space is an event.
Any event which consists of a single outcome in the sample space
is called an elementary or simple event. Events which consist of
more than one outcome are called compound events.
Set theory is used to represent relationships among events. In
general, if A and B are two events in the sample space S, then
(A union B) = 'either A or B occurs or both occur'
(A intersection B) = 'both A and B occur'
(A is a subset of B) = 'if A occurs, so does B'
A' or = 'event A does not occur'
(the empty set) = an impossible event
S (the sample space) = an event that is certain to occur

Example
Experiment: rolling a dice once -
Sample space S = {1,2,3,4,5,6}
Events A = 'score < 4' = {1,2,3}
B = 'score is even' = {2,4,6}
C = 'score is 7' =
= 'the score is < 4 or even or both' = {1,2,3,4,6}
= 'the score is < 4 and even' = {2}
A' or = 'event A does not occur' = {4,5,6}

Relative Frequency
Relative frequency is another term for proportion; it is the value
calculated by dividing the number of times an event occurs by the
total number of times an experiment is carried out. The probability
of an event can be thought of as its long-run relative frequency
when the experiment is carried out many times.
If an experiment is repeated n times, and event E occurs r times,
then the relative frequency of the event E is defined to be
rfn(E) = r/n

Example
Experiment: Tossing a fair coin 50 times (n = 50)
Event E = 'heads'
Result: 30 heads, 20 tails, so r = 30
mca-5 246

Relative frequency: rfn(E) = r/n = 30/50 = 3/5 = 0.6

If an experiment is repeated many, many times without changing
the experimental conditions, the relative frequency of any particular
event will settle down to some value. The probability of the event
can be defined as the limiting value of the relative frequency:

P(E) = rfn(E)
For example, in the above experiment, the relative frequency of the
event 'heads' will settle down to a value of approximately 0.5 if the
experiment is repeated many more times.
Probability

A probability provides a quantatative description of the likely

occurrence of a particular event. Probability is conventionally
expressed on a scale from 0 to 1; a rare event has a probability
close to 0, a very common event has a probability close to 1.

The probability of an event has been defined as its long-run relative

frequency. It has also been thought of as a personal degree of
belief that a particular event will occur (subjective probability).

In some experiments, all outcomes are equally likely. For example

if you were to choose one winner in a raffle from a hat, all raffle
ticket holders are equally likely to win, that is, they have the same
probability of their ticket being chosen. This is the equally-likely
outcomes model and is defined to be:

number of outcomes corresponding to event E

P(E) =
total number of outcomes
Examples
1. The probability of drawing a spade from a pack of 52 well-
shuffled playing cards is 13/52 = 1/4 = 0.25 since
event E = 'a spade is drawn';
the number of outcomes corresponding to E = 13 (spades);
the total number of outcomes = 52 (cards).
2. When tossing a coin, we assume that the results 'heads' or
'tails' each have equal probabilities of 0.5.

Subjective Probability
A subjective probability describes an individual's personal
judgement about how likely a particular event is to occur. It is not
based on any precise computation but is often a reasonable
assessment by a knowledgeable person.

Like all probabilities, a subjective probability is conventionally

expressed on a scale from 0 to 1; a rare event has a subjective
mca-5 247

probability close to 0, a very common event has a subjective

probability close to 1.

A person's subjective probability of an event describes his/her

degree of belief in the event.

Example
A Rangers supporter might say, "I believe that Rangers have
probability of 0.9 of winning the Scottish Premier Division this year
since they have been playing really well."

Independent Events
Two events are independent if the occurrence of one of the events
gives us no information about whether or not the other event will
occur; that is, the events have no influence on each other.

In probability theory we say that two events, A and B, are

independent if the probability that they both occur is equal to the
product of the probabilities of the two individual events, i.e.

The idea of independence can be extended to more than two

events. For example, A, B and C are independent if:
a. A and B are independent; A and C are independent and B
and C are independent (pairwise independence);
b.
If two events are independent then they cannot be mutually
exclusive (disjoint) and vice versa.

Example
Suppose that a man and a woman each have a pack of 52 playing
cards. Each draws a card from his/her pack. Find the probability
that they each draw the ace of clubs.
We define the events:
A = probability that man draws ace of clubs = 1/52
B = probability that woman draws ace of clubs = 1/52
Clearly events A and B are independent so:
= 1/52 . 1/52 = 0.00037
That is, there is a very small chance that the man and the woman
will both draw the ace of clubs.
Conditional Probability

In many situations, once more information becomes available, we

are able to revise our estimates for the probability of further
outcomes or events happening. For example, suppose you go out
for lunch at the same place and time every Friday and you are
served lunch within 15 minutes with probability 0.9. However, given
that you notice that the restaurant is exceptionally busy, the
probability of being served lunch within 15 minutes may reduce to
mca-5 248

0.7. This is the conditional probability of being served lunch within

15 minutes given that the restaurant is exceptionally busy.

The usual notation for "event A occurs given that event B has
occurred" is "A | B" (A given B). The symbol | is a vertical line and
does not imply division. P(A | B) denotes the probability that event
A will occur given that event B has occurred already.

A rule that can be used to determine a conditional probability from

unconditional probabilities is:

where:
P(A | B) = the (conditional) probability that event A will occur
given that event B has occured already
= the (unconditional) probability that event A and
event B both occur
P(B) = the (unconditional) probability that event B occurs

Example:
When a fair dice is tossed, the conditional probability of getting ‘1’ ,
given that an odd number has been obtained, is equal to 1/3 as
explained below:
S = {1,2,3,4,5,6}; A ={1,3,5};B={1};A B= {1}
P(B/A) =1/6 / ½ = 1/3

Multiplication rule for dependent events:

The probability of simultaneous occurrence of two events A and B
is equal to the product of the probability of one of the events by the
conditional probability of the other, given that the first one has
already occurred.

Example:
From a pack of cards,2 cards are drawn in succession one after the
other. After every draw, the selected card is not replaced. What is
the probability that in both the draws you will get spades?

Solution:
Let A = getting spade in the first draw
Let B = getting spade in the second draw.
The cards are not replaced.
This situation requires the use of conditional probability.
P(A)= 13/52
P(B/A)= 12/51

Mutually Exclusive Events

Two events are mutually exclusive (or disjoint) if it is impossible for
them to occur together.
mca-5 249

Formally, two events A and B are mutually exclusive if and only if

If two events are mutually exclusive, they cannot be independent

and vice versa.
Examples
1. Experiment: Rolling a die once
Sample space S = {1,2,3,4,5,6}
Events A = 'observe an odd number' = {1,3,5}
B = 'observe an even number' = {2,4,6}
= the empty set, so A and B are mutually exclusive.

2. A subject in a study cannot be both male and female, nor

can they be aged 20 and 30. A subject could however be
both male and 20, or both female and 30.
Addition Rule
The addition rule is a result used to determine the probability that
event A or event B occurs or both occur.
The result is often written as follows, using set notation:

where:
P(A) = probability that event A occurs
P(B) = probability that event B occurs
= probability that event A or event B occurs
= probability that event A and event B both occur
For mutually exclusive events, that is events which cannot occur
together:
=0
The addition rule therefore reduces to
= P(A) + P(B)
For independent events, that is events which have no influence on
each other:

The addition rule therefore reduces to

Example
Suppose we wish to find the probability of drawing either a king or a
spade in a single draw from a pack of 52 playing cards.
We define the events A = 'draw a king' and B = 'draw a spade'
Since there are 4 kings in the pack and 13 spades, but 1 card is
both a king and a spade, we have:
= 4/52 + 13/52 - 1/52 = 16/52
So, the probability of drawing either a king or a spade is 16/52 (=
4/13).
See also multiplication rule.
Multiplication Rule
The multiplication rule is a result used to determine the probability
that two events, A and B, both occur.
mca-5 250

The multiplication rule follows from the definition of conditional

probability.
The result is often written as follows, using set notation:

where:
P(A) = probability that event A occurs
P(B) = probability that event B occurs
= probability that event A and event B occur
P(A | B) = the conditional probability that event A occurs
given that event B has occurred already
P(B | A) = the conditional probability that event B occurs
given that event A has occurred already

For independent events, that is events which have no influence on

one another, the rule simplifies to:

That is, the probability of the joint events A and B is equal to the
product of the individual probabilities for the two events.
Multiplication rule for independent events:
Example:
The probability that you will get an A grade in Quantitative methods
is 0.7.The probability that you will get an A grade in Marketing is
0.5.Assuming these two courses are independent, compute the
probability that you will get an A grade in both these subjects.
Solution:
Let A = getting A grade in quantitative methods
Let B= getting A grade in Marketing
It is given that a and B are independent.
Applying the formula,
We get, P(A and B) = P(A).P(B)=.7*.5=.35

Conditional Probability
In many situations, once more information becomes available, we
are able to revise our estimates for the probability of further
outcomes or events happening. For example, suppose you go out
for lunch at the same place and time every Friday and you are
served lunch within 15 minutes with probability 0.9. However, given
that you notice that the restaurant is exceptionally busy, the
probability of being served lunch within 15 minutes may reduce to
0.7. This is the conditional probability of being served lunch within
15 minutes given that the restaurant is exceptionally busy.

A rule that can be used to determine a conditional probability from

unconditional probabilities is:

Law of Total Probability

The result is often written as follows, using set notation:

where:
P(A) = probability that event A occurs
= probability that event A and event B both occur
= probability that event A and event B' both occur,
i.e. A occurs and B does not.
Using the multiplication rule, this can be expressed as
P(A) = P(A | B).P(B) + P(A | B').P(B')

Bayes' Theorem
Bayes' Theorem is a result that allows new information to be used
to update the conditional probability of an event.
Using the multiplication rule, gives Bayes' Theorem in its simplest
form:

Using the Law of Total Probability:

P(B | A).P(A)
P(A | B) =
P(B | A).P(A) + P(B | A').P(A')
where:
P(A) = probability that event A occurs
P(B) = probability that event B occurs
P(A') = probability that event A does not occur
P(A | B) = probability that event A occurs given that event B
has occurred already
P(B | A) = probability that event B occurs given that event A
has occurred already
P(B | A') = probability that event B occurs given that event A
has not occurred already
Example:
A manufacturing firm is engaged in the production of steel
pipes in its three plants with a daily production of 1000,
1500and 2500 units respectively. According to the past
experience, it is known that the fractions of defective pipes
produced by the three plants are respectively 0.04, 0.09 and
mca-5 252

0.07.If a pipe is selected from a day’s total production and

found to be defective, find out a) What is the probability of
the defective pipes) What is the probability that it has come
from the second plant?
Solution:
Let the probabilities of the possible events be
Probability that a pipe is manufactured in plant A =
P(E1)=1000/(1000+1500+2500)=0.2
Probability that a pipe is manufactured in plant B=
P(E2)=1500/(1000+1500+2500)0.3
Probability that a pipe is manufactured in plant C
=P(E3)=2500/(1000+1500+2500)=0.5
Let P(D) be the probability that a defective pipe is drawn.
Given that the proportions of the defective pipes coming
from the three plants are 0.04.0.09 and 0.07 respectively,
these are, in fact, the conditional probabilities
(D/E1)=0.04;P(D/E2)=0.09 AND P(D/E3)=0.07
Now we can multiply prior probabilities and conditional
probabilities in order to obtain the joint probabilities.
Joint probabilities are
Plant A = .04*.2 =.008
Plant B = .09*.3 =.027
Plant C = .07*.5 = .035
Now we can obtain posterior probabilities by the following
calculations:
Plant A = P(E1/D) = .008/0.008+0.027+0.035 = .114
Plant B = P(E2/D)= 0.027/0.008+0.027+0.035 =.386
Plant C = P(E3/D) = 0.035/0.008+0.027+0.035 =.500
Computation of posterior probabilities

Event priorP(Ei) Conditional Joint probability

Posterior
P(D/Ei) P(Ei/D)
E1 0.2 0.04 0.04X0.2=0.008 0.08/0.07=.11
E2 0.3 0.09 0.09X0.3=0.027 0.027/0.07=.39
E3 0.5 0.07 0.07X0.5=0.035 0.035/0.07=.50
TOTAL 1.0 P(D)=0.07 1.00

On the basis of these calculations we can say that a)most

probably the defective pipe has come from plant c
b)the probability that the defective pipe has come from the
second plant is 0.39

Prior probability vs. posterior probability

We have seen in the foregone table that as any additional
information becomes available, it can be used to revise the
prior probability. The revised probability is called the
posterior probability. Management should know how to use
the additional information; it should also assess the utility or
worth of the additional information. It may, at times find that
mca-5 253

the cost of obtaining the additional information is more than

its actual worth. In such cases, obviously it is not advisable
to go in for any additional information and management
should be satisfied with the prior probabilities.

Example 1:
The Monty Hall problem
We are presented with three doors - red, green, and blue - one of
which has a prize. We choose the red door, which is not opened
until the presenter performs an action. The presenter who knows
what door the prize is behind, and who must open a door, but is not
permitted to open the door we have picked or the door with the
prize, opens the green door and reveals that there is no prize
behind it and subsequently asks if we wish to change our mind
about our initial selection of red. What is the probability that the
prize is behind the blue and red doors?
Let us call the situation that the prize is behind a given door Ar, Ag,
and Ab.
To start with, , and to make things simpler we shall assume that we
have already picked the red door.
Let us call B "the presenter opens the green door". Without any
prior knowledge, we would assign this a probability of 50%
 In the situation where the prize is behind the red door, the
host is free to pick between the green or the blue door at
random. Thus, P(B | Ar) = 1 / 2
 In the situation where the prize is behind the green door, the
host must pick the blue door. Thus, P(B | Ag) = 0
 In the situation where the prize is behind the blue door, the
host must pick the green door. Thus, P(B | Ab) = 1
Thus,
Note how this depends on the value of P(B).

SOLVE:
1.In a software test environment holding software developed on
j2ee specification a down time analysis was done. Based on the
100 earlier records it was found that there is about 5% downtime
per day. A study on the components involved in the environment
shows that a problem in web sphere cause errors out of which 25%
led to downtime. If there are issues in the operating system, 40% of
the issues lead to a down time and again 20% of the problems in
network led to a downtime. Given that there is a downtime find the
probability that each of the above reason could have contributed
the downtime between themselves (considering just these 3
reasons)

Solutions:
Let the occurrence of a downtime be D
P(D) =5%=.05
Let the occurrence of a web sphere error be W
mca-5 254

Probability of web sphere error causing downtime P (D/W)=.25

Let the occurrence of a OPERATING SYSTEM ERROR BE o
Probability of OS error causing downtime P(D/O)=.4
Let the occurrence of network error be N
Probability of N Causing downtime P(D/M)=.2
P(W/D)=?
P(O/D)=?
P(N/D)=?

2.In a bolt factory machines A1,A2,A3 manufactures

respectively 25%,35%,40% of the output of these 5,4,2 percent
are defective bolts. A bolt is drawn at random from the product
and is found to be defective. What is the probability that it was
manufactured by machine A2?

3. Suppose there is a chance for a newly constructed building

to collapse whether the design is faulty or not. The chance
that the design faulty is 10%.the chance that the building
collapse is 95% if the design is faulty and otherwise it is
45%.it is seen that the building collapsed. What is the
probability that it is due to faulty design?
issued by suburbs?

Summary:
This unit provides a conceptual framework on probability concepts
with examples. Specifically this unit is focused on:
 The meaning and definition of the term probability interwoven
with other associated terms-event, experiment and sample
space.
 The three types of probability –classical probability, statistical
probability and subjective probability.
 The concept of mutually exclusive events and independent
events.
 The rules for calculating probability which include addition rue
for mutually exclusive events and non mutually exclusive
events, multiplication rule for independent events and
dependent events and conditional probability.
 The application of Bayer’s theorem in management.

Have you understood?

1. An urn contains 75 marbles: 35 are blue and 25 of these
blue marbles are swirled. Rest of them are red and 30 of red
ones are swirled (not swirled are clear ones).What is the
probability of drawing?
a) blue marble
b) clear marble
c) blue swirled
d) red clear
mca-5 255

e) swirled marble
f)
2. Two persons X and Y appear in an interview for two
vacancies in the same post, the probability of X’s selection is
1/5 and that of Y’s selection is 1/3.What is the probability
that:1)both X and Y will be selected?2)Only one of them will
be selected?3)none of them will be selected?

3. A company is to appoint a person as its managing director

who must be an M.Tech and M.B.A and C.A.The probability
of which are one in twenty five, one in forty and one in fifty
respectively. Find the probability of getting such a person to
be appointed by the company.

4. A sample of 500 respondents was selected in a large

metropolitan area to determine various information
concerning consumer behaviour.Among the questions asked
was,” Do you enjoy shopping for clothing? “of 240 males,136
answered yes. Of 260 females,224 answered yes.
a) Set up a 2X2 table to evaluate the probabilities
b) Give an example of a simple event
c) Give an example of a joint event
d) What is the compliment of “Enjoy shopping for clothing”?
What is the probability that a respondent chosen at random
e) Is a male?
f) enjoys shopping for clothing?
g)is a female and enjoys shopping for clothing?
h)is a male and does not enjoy shopping for clothing?
i)is a female or enjoys shopping for clothing?

5. A municipal bond service has three rating categories(A, Band

C).Suppose that in the past year, of the municipal bonds issued
throughout the United States,70% were rated A,20%were rated
B,and 10% were rated C.Of the municipal bonds rated A,50% were
issued by cities,40% by suburbs, and 20%by rural areas. Of the
municipal bonds rated C, 90% were issued by cities,5% by
suburbs, and 5% by rural areas.
a)If a new municipal bond is to be issued by a city, What is the
probability it will receive A rating?
b) What proportion of municipal bonds is issued by cities?
c) What proportion of municipal bonds is issued by suburbs?

6. An advertising executive is studying television viewing habits of

married men and women during prime time hours. On the basis of
past viewing records, the executive has determined that during
prime time, husbands are watching television 60% of the time. It
has also been determined that when the husband is watching
television 40% of the time the wife is also watching. When the
mca-5 256

husbands are watching television 30% of the wives are watching

.Find the probability that

a)if the wife is watching television ,the husbands are also watching
television.
b)the wives are watching television in prime time.
7)In the past several years, credit card companies have made an
aggressive effort to solicit new accounts from college students.
Suppose that a sample of 200 students at your college indicated
the following information as to whether the student possessed a
bank credit card and/or a travel and entertainment credit card.

8)A software company develops banking software where

performance variable depends on the number of customer
accounts.The list given below provides the statistics of 150 of the
clients using the software along with the performance variable
group.
Tot customer accounts performance variable group
no of clientsusing
0-50 A 7
25-100 B 14
75-200 C 28
200-300 D 60
300-400 E 25
400-500 F 16

a)find the probability that a client has <200 user accounts.

b)find the probability that the performance variable B is used.
c)A client chosen will fit in both variable A and B,if the clients who
can use A is 9 and the clients who can use B is 19.

9)there are two men aged 30 and 36 years.The probability to live

35 years more is .67 for the 30 year old [person and .60 for the 36
year old person.Find the probability that atleast one of these
persons will be alive 35 years.

10)the probability that a contractor will get a plumbing contract is

2/3.and the probability that he will not an electric contract is 5/9.If
the probability of getting at least one contract is 4/5.what is the
probability that he will get from both the contracts?







mca-5 257







2
PROBABILITY DISTRIBUTION
INTRODUCTION

In unit1, we encountered some experiments where the outcomes

were categorical. We found that an experiment results in a number
of possible outcomes and discussed how the probability of the
occurrence of an outcome can be determined. In this unit, we shall
extend our discussion of probability theory. Our focus is on the
probability distribution, which describes how probability is spread
over the possible numerical values associated with the outcomes.

LEARNING OBJECTIVES

After reading this unit, you will be able to

Define random variables
Appreciate what is probability distribution
Explain and use the binomial distribution
Explain and use the poison distribution
Explain and use the uniform distribution
Explain and use the normal distribution

RANDOM VARIABLE

A random variable is an abstraction of the intuitive concept of

chance into the theoretical domains of mathematics, forming the
foundations of probability theory and mathematical statistics.

The theory and language of random variables were formalized over

the last few centuries alongside ideas of probability. Full familiarity
with all the properties of random variables requires a strong
background in the more recently developed concepts of measure
theory, but random variables can be understood intuitively at
various levels of mathematical fluency; set theory and calculus are
fundamentals.

Broadly, a random variable is defined as a quantity whose values

are random and to which a probability distribution is assigned. More
mca-5 258

formally, a random variable is a measurable function from a sample

space to the measurable space of possible values of the variable.
The formal definition of random variables places experiments
involving real-valued outcomes firmly within the measure-theoretic
framework and allows us to construct distribution functions of real-
valued random variables.

WHAT IS A RANDOM VARIABLE?

A Random Variable is a function, which assigns unique numerical

values to all possible outcomes of a random experiment under fixed
conditions(Ali 2000). A random variable is not a variable but rather
a function that maps events to numbers (Wikipedia 2006).
Example 1
This example is extracted from (Ali 2000). Suppose that a coin is
tossed three times and the sequence of heads and tails is noted.
The sample space for this experiment evaluates to: S={HHH, HHT,
HTH, HTT, THH, THT, TTH, TTT}. Now let the random variable X
be the number of heads in three coin tosses. X assigns each
outcome in S a number from the set Sx={0, 1, 2, 3}. The table
below lists the eight outcomes of S and the corresponding values of
X.
Outcome HHH HHT HTH THH HTT THT TTH TTT
X 3 2 2 2 1 1 1 0
X is then a random variable taking on values in the set SX = {0, 1,
2, 3}.
Mathematically, a random variable is defined as a measurable
function from a probability space to some measurable space
(Wikipedia 2006). This measurable space is the space of possible
values of the variable, and it is usually taken to be the real numbers
(Wikipedia 2006).
The condition for a function to be a random variable is that the
random variable cannot be multivalued (Ali 2000).
There are three types of random variables:

 A Continuous Random Variable is one that takes an infinite

number of possible values (Ali 2000). Example: Duration of a
call in a telephone exchange.
 A Discrete Random Variable is one that takes a finite distinct
values (Ali 2000). Example: A number of students who fail a
test.
 A Mixed Random Variable is one for which some of its
values are continuous and some are discrete (Ali 2000).

Problem 1
mca-5 259

Can measurements of power (in dB) received from an antenna be

considered a random variable?

Solution 1
Yes. Specifically it should be considered as a continuous random
variable as the power of any signal attenuates through a
transmission line. The attenuation factors associated with each
transmission line are only approximate. Thus the power received
from the antenna can take any value.

In contrast, if a random variable represents a measurement on a

continuous scale so that all values in an interval are possible, it is
called a continuous random variable. In otherwords,a continuous
random variable is a random variable, which can take any value
within some interval of real number. Examples of a continuous
random variable are price of a car and daily consumption of milk.
Measurement of the height and weight of the respondents is an
example of a continuous random variable.similarly; voltage,
pressure, and temperature are examples of continuous random
variable.

Measures of location

Location
A fundamental task in many statistical analyses is to estimate a
location parameter for the distribution; i.e., to find a typical or
central value that best describes the data.

Definition of location

The first step is to define what we mean by a typical value. For

univariate data, there are three common definitions:

1. mean - the mean is the sum of the data points divided by the
number of data points. That is,

The mean is that value that is most commonly referred to as

the average. We will use the term average as a synonym for
the mean and the term typical value to refer generically to
measures of location.

2. median - the median is the value of the point which has half
the data smaller than that point and half the data larger than
that point. That is, if X1, X2, ... ,XN is a random sample sorted
mca-5 260

from smallest value to largest value, then the median is

defined as:

3. mode - the mode is the value of the random sample that

occurs with the greatest frequency. It is not necessarily
unique. The mode is typically used in a qualitative fashion.
For example, there may be a single dominant hump in the
data perhaps two or more smaller humps in the data. This is
usually evident from a histogram of the data.

When taking samples from continuous populations, we need to be

somewhat careful in how we define the mode. That is, any specific
value may not occur more than once if the data are continuous.
What may be a more meaningful, if less exact measure, is the
midpoint of the class interval of the histogram with the highest
peak.

Why different measures?

A natural question is why we have more than one measure of the

typical value. The following example helps to explain why these
alternative definitions are useful and necessary.

This plot shows histograms for 10,000 random numbers generated

from a normal, an exponential, a Cauchy, and a lognormal
distribution.

Normal distribution
mca-5 261

The first histogram is a sample from a normal distribution. The

mean is 0.005, the median is -0.010, and the mode is -0.144 (the
mode is computed as the midpoint of the histogram interval with the
highest peak).

The normal distribution is a symmetric distribution with well-

behaved tails and a single peak at the center of the distribution. By
symmetric, we mean that the distribution can be folded about an
axis so that the 2 sides coincide. That is, it behaves the same to the
left and right of some center point. For a normal distribution, the
mean, median, and mode are actually equivalent. The histogram
above generates similar estimates for the mean, median, and
mode. Therefore, if a histogram or normal probability plot indicates
that your data are approximated well by a normal distribution, then
it is reasonable to use the mean as the location estimator.

Exponential distribution
The second histogram is a sample from an exponential distribution.
The mean is 1.001, the median is 0.684, and the mode is 0.254
(the mode is computed as the midpoint of the histogram interval
with the highest peak).

The exponential distribution is a skewed, i. e., not symmetric,

distribution. For skewed distributions, the mean and median are not
the same. The mean will be pulled in the direction of the skewness.
That is, if the right tail is heavier than the left tail, the mean will be
greater than the median. Likewise, if the left tail is heavier than the
right tail, the mean will be less than the median.

For skewed distributions, it is not at all obvious whether the mean,

the median, or the mode is the more meaningful measure of the
typical value. In this case, all three measures are useful.

Cauchy distribution
The third histogram is a sample from a Cauchy distribution. The
mean is 3.70, the median is -0.016, and the mode is -0.362 (the
mode is computed as the midpoint of the histogram interval with the
highest peak).

For better visual comparison with the other data sets, we restricted
the histogram of the Cauchy distribution to values between -10 and
10. The full Cauchy data set in fact has a minimum of
approximately -29,000 and a maximum of approximately 89,000.

The Cauchy distribution is a symmetric distribution with heavy tails

and a single peak at the center of the distribution. The Cauchy
distribution has the interesting property that collecting more data
does not provide a more accurate estimate of the mean. That is,
the sampling distribution of the mean is equivalent to the sampling
mca-5 262

distribution of the original data. This means that for the Cauchy
distribution the mean is useless as a measure of the typical value.
For this histogram, the mean of 3.7 is well above the vast majority
of the data. This is caused by a few very extreme values in the tail.
However, the median does provide a useful measure for the typical
value.

Although the Cauchy distribution is an extreme case, it does

illustrate the importance of heavy tails in measuring the mean.
Extreme values in the tails distort the mean. However, these
extreme values do not distort the median since the median is based
on ranks. In general, for data with extreme values in the tails, the
median provides a better estimate of location than does the mean.
Lognormal distribution

The fourth histogram is a sample from a lognormal distribution. The

mean is 1.677, the median is 0.989, and the mode is 0.680 (the
mode is computed as the midpoint of the histogram interval with the
highest peak).
The lognormal is also a skewed distribution. Therefore the mean
and median do not provide similar estimates for the location. As
with the exponential distribution, there is no obvious answer to the
question of which is the more meaningful measure of location.

Robustness
There are various alternatives to the mean and median for
measuring location. These alternatives were developed to address
non-normal data since the mean is an optimal estimator if in fact
your data are normal.

Tukey and Mosteller defined two types of robustness where

robustness is a lack of susceptibility to the effects of nonnormality.

1. Robustness of validity means that the confidence intervals

for the population location have a 95% chance of covering
the population location regardless of what the underlying
distribution is.
2. Robustness of efficiency refers to high effectiveness in the
face of non-normal tails. That is, confidence intervals for the
population location tend to be almost as narrow as the best
that could be done if we knew the true shape of the
distributuion.

The mean is an example of an estimator that is the best we can do

if the underlying distribution is normal. However, it lacks robustness
of validity. That is, confidence intervals based on the mean tend not
to be precise if the underlying distribution is in fact not normal.
mca-5 263

The median is an example of a an estimator that tends to have

robustness of validity but not robustness of efficiency.

The alternative measures of location try to balance these two

concepts of robustness. That is, the confidence intervals for the
case when the data are normal should be almost as narrow as the
confidence intervals based on the mean. However, they should
maintain their validity even if the underlying data are not normal. In
particular, these alternatives address the problem of heavy-tailed
distributions.

Alternative measures of location

A few of the more common alternative location measures are:

1. Mid-Mean - computes a mean using the data between the

25th and 75th percentiles.
2. Trimmed Mean - similar to the mid-mean except different
percentile values are used. A common choice is to trim 5%
of the points in both the lower and upper tails, i.e., calculate
the mean for data between the 5th and 95th percentiles.
3. Winsorized Mean - similar to the trimmed mean. However,
instead of trimming the points, they are set to the lowest (or
highest) value. For example, all data below the 5th percentile
are set equal to the value of the 5th percentile and all data
greater than the 95th percentile are set equal to the 95th
percentile.
4. Mid-range = (smallest + largest)/2.

The first three alternative location estimators defined above have

the advantage of the median in the sense that they are not unduly
affected by extremes in the tails. However, they generate estimates
that are closer to the mean for data that are normal (or nearly so).
The mid-range, since it is based on the two most extreme points, is
not robust. Its use is typically restricted to situations in which the
behavior at the extreme points is relevant.

Measures of Skew ness and Kurtosis

Skew ness and Kurtosis

A fundamental task in many statistical analyses is to characterize

the location and variability of a data set. A further characterization
of the data includes skewness and kurtosis.

Skewness is a measure of symmetry, or more precisely, the lack of

symmetry. A distribution, or data set, is symmetric if it looks the
same to the left and right of the center point.
mca-5 264

Kurtosis is a measure of whether the data are peaked or flat

relative to a normal distribution. That is, data sets with high kurtosis
tend to have a distinct peak near the mean, decline rather rapidly,
and have heavy tails. Data sets with low kurtosis tend to have a flat
top near the mean rather than a sharp peak. A uniform distribution
would be the extreme case.

The histogram is an effective graphical technique for showing both

the skewness and kurtosis of data set.

Definition of skewness

For univariate data Y1, Y2, ..., YN, the formula for skewness is:

where is the mean, is the standard deviation, and N is the

number of data points. The skewness for a normal distribution is
zero, and any symmetric data should have a skewness near zero.
Negative values for the skewness indicate data that are skewed left
and positive values for the skewness indicate data that are skewed
right. By skewed left, we mean that the left tail is long relative to the
right tail. Similarly, skewed right means that the right tail is long
relative to the left tail. Some measurements have a lower bound
and are skewed right. For example, in reliability studies, failure
times cannot be negative.

Definition of kurtosis

For univariate data Y1, Y2, ..., YN, the formula for kurtosis is:

where is the mean, is the standard deviation, and N is the

number of data points.

The kurtosis for a standard normal distribution is three. For this

reason, some sources use the following defition of kurtosis:

This definition is used so that the standard normal distribution has a

kurtosis of zero. In addition, with the second definition positive
kurtosis indicates a "peaked" distribution and negative kurtosis
indicates a "flat" distribution.
Which definition of kurtosis is used is a matter of convention. When
using software to compute the sample kurtosis, you need to be
aware of which convention is being followed.
mca-5 265

Examples
The following example shows histograms for 10,000 random
numbers generated from a normal, a double exponential, a Cauchy,
and a Weibull distribution.

Probability distribution:

A probability distribution is a total listing of the various

values. The random variable can come along with the
corresponding probability for each value. A real life example would
be the pattern of distribution of the machine breakdowns in a
manufacturing unit. The random variable in this example would be
the various values the machine breakdown could assume. The
probability corresponding to each value of the breakdown is the
relative frequency of occurrence of the breakdown. The probability
distribution for this example is constructed by the actual breakdown
pattern observed over a period of time.

1. A multinational bank is concerned about the waiting time of its

customers for using their ATMs.A study of a random sample of 500
customers reveals the following probability distribution:
x(waiting time /customer in minutes):0 1 2 3 4
5 6 7 8
p(x): .20 .18 .16 .12 .10
.09 .08 .04 .03
a) What is the probability that a customer will have to wait for more
than 5 minutes?
b) What is the probability that a customer need not wait?
c) What is the probability that a customer will have to wait for less
than 4 minutes?

Solution:
a)p(x>5) =p(6)+p(8)=.08+.04+.03=.15
mca-5 266

b)p(x=0)=.20
c)p(x<4)=p(0)+p(1)+p(2)+p(3)=.20+.18+.16+.12=.66

Types of probability distribution:

There are two types of probability distribution: They are
1. Discrete probability distribution
2. Continous probability distribution
Discrete probability distribution
We have taken the above examples to explain the concept of a

1. Discrete Distributions

Discrete Densities

Suppose that we have a random experiment with sample space R,

and probability measure P. A random variable X for the experiment
that takes values in a countable set S is said to have a discrete
distribution. The (discrete) probability density function of X is
the function f from S to R defined by

f(x) = P(X = x) for x in S.

1. Show that f satisfies the following properties:

a. f(x) 0 for x in S.
b. x in S f(x) = 1
c. x in A f(x) = P(X A) for A S.

Property (c) is particularly important since it shows that the

probability distribution of a discrete random variable is completely
determined by its density function. Conversely, any function that
satisfies properties (a) and (b) is a (discrete) density, and then
property (c) can be used to construct a discrete probability
distribution on S. Technically, f is the density of X relative to
counting measure on S.

Typically, S is a countable subset of some larger set, such as Rn for

some n. We can always extend f, if we want, to the larger set by
defining f(x) = 0 for x not in S. Sometimes this extension simplifies
formulas and notation.

An element x in S that maximizes the density f is called a mode of

the distribution. When there is only one mode, it is sometimes used
as a measure of the center of the distribution.
mca-5 267

Interpretation

A discrete probability distribution is equivalent to a discrete mass

distribution, with total mass 1. In this analogy, S is the (countable)
set of point masses, and f(x) is the mass of the point at x in S.
Property (c) in Exercise 1 simply means that the mass of a set A
can be found by adding the masses of the points in A.

For a probabilistic interpretation, suppose that we create a new,

compound experiment by repeating the original experiment
indefinitely. In the compound experiment, we have independent
random variables X1, X2, ..., each with the same distribution as X
(these are "independent copies" of X). For each x in S, let

fn(x) = #{i {1, 2, ..., n}: Xi = x} / n,

the relative frequency of x in the first n runs (the number of times

that x occurred, divided by n). Note that for each x, fn(x) is a
random variable for the compound experiment. By the law of large
numbers, fn(x) should converge to f(x) as n increases. The function
fn is called the empirical density function; these functions are
displayed in most of the simulation applets that deal with discrete
variables.

Examples

2. Suppose that two fair dice are tossed and the sequence of
scores (X1, X2) recorded. Find the density function of

a. (X1, X2)
b. Y = X1 + X2, the sum of the scores
c. U = min{X1, X2}, the minimum score
d. V = max{X1, X2}, the maximum score
e. (U, V)

3. In the dice experiment, select n = 2 fair dice. Select the

following random variables and note the shape and location of the
density function. Run the experiment 1000 times, updating every 10
runs. For each variables, note the apparent convergence of the
empirical density function to the density function.

a. Sum of the scores.

b. Minimum score.
c. Maximum score.

4. An element X is chosen at random from a finite set S.

a. Show that X has probability density function f(x) = 1 / #(S) for

x in S.
mca-5 268

b. Show that P(X A) = #(A) / #(S) for A S.

The distribution in the last exercise is called the discrete uniform

distribution on S. Many random variables that arise in sampling or
combinatorial experiments are transformations of uniformly
distributed variables.

5. Suppose that n elements are chosen at random, without

replacement from a set D with N elements. Let X denote the
ordered sequence of elements chosen. Argue that X is uniformly
distributed on the set S of permutations of size n chosen from D:

P(X = x) = 1 / (N)n for each x in S.

6. Suppose that n elements are chosen at random, without

replacement, from a set D with N elements. Let W denote the
unordered set of elements chosen. Show that W is uniformly
distributed on the set T of combinations of size n chosen from D:

P(W = w) = 1 / C(N, n) for w in T.

7. An urn contains N balls; R are red and N - R are green. A

sample of n balls is chosen at random (without replacement). Let Y
denote the number of red balls in the sample. Show that Y has
probability density function.

P(Y = k) = C(R, k) C(N - R, n - k) / C(N, n) for k = 0, 1, ..., n.

The distribution defined by the density function in the last exercise

is the hypergeometric distribution with parameters N, R, and n.
The hypergeometric distribution is studied in detail in the chapter on
Finite Sampling Models, which contains a rich variety of
distributions that are based on discrete uniform distributions.

8. An urn contains 30 red and 20 green balls. A sample of 5 balls

is selected at random. Let Y denote the number of red balls in the
sample.

a. Compute the density function of Y explicitly.

b. Graph the density function and identify the mode(s).
c. Find P(Y > 3).

9. In the ball and urn experiment, select sampling without

replacement. Run the experiment 1000 times, updating every 10
runs, and note the apparent convergence of the empirical density
function of Y to the theoretical density function.
mca-5 269

10. A coin with probability of heads p is tossed n times. For j = 1,

..., n, let Ij = 1 if the toss j is heads and Ij = 0 if toss j is tails. Show
that (I1, I2, ..., In) has probability density function

f(i1, i2, ..., in) = pk(1 - p)n - k for ij in {0, 1} for each j, where k = i1 +
i2 + ··· + in.

11. A coin with probability of heads p is tossed n times. Let X

denote the number of heads. Show that X has probability density
function

P(X = k) = C(n, k) pk (1 - p)n - k for k = 0, 1, ..., n.

The distribution defined by the density in the previous exercise is

called the binomial distribution with parameters n and p. The
binomial distribution is studied in detail in the chapter on Bernoulli
Trials.

12. Suppose that a coin with probability of heads p = 0.4 is

tossed 5 times. Let X denote the number of heads.

a. Compute the density function of X explicitly.

b. Graph the density function and identify the mode.
c. Find P(X > 3).

13. In the coin experiment, set n = 5 and p = 0.4. Run the

experiment 1000 times, updating every 10 runs, and note the
apparent convergence of the empirical density function of X to the
density function.

14. Let ft(n) = exp(-t) tn / n! for n = 0, 1, 2, .... where t > 0 is a

parameter.

a. Show that ft is a probability density function for each t > 0.

b. Show that ft(n) > ft(n - 1) if and only if n < t.
c. Show that the mode occurs at floor(t) if t is not an integer,
and at t -1 and t if t is an integer.

The distribution defined by the density in the previous exercise is

the Poisson distribution with parameter t, named after Simeon
Poisson. The Poisson distribution is studied in detail in the Chapter
on Poisson Processes, and is used to model the number of
"random points" in a region of time or space. The parameter t is
proportional to the size of the region of time or space.

15. Suppose that the number of misprints N on a web page has

the Poisson distribution with parameter 2.5.

a. Find the mode.

mca-5 270

b. Find P(N > 4).

16. In the Poisson process, select parameter 2.5. Run the

simulation 1000 times updating every 10 runs. Note the apparent
convergence of the empirical density function to the true density
function.

17. In the die-coin experiment, a fair die is rolled and then a fair
coin is tossed the number of times shown on the die. Let I denote
the sequence of coin results (0 for tails, 1 for heads). Find the
density of I (note that I takes values in a set of sequences of
varying lengths).

Constructing Densities

18. Suppose that g is a nonnegative function defined on a

countable set S and that

c= x in S g(x).

Show that if c is positive and finite, then f(x) = g(x) / c for x in S

defines a discrete density function on S.

The constant c in the last exercise is sometimes called the

normalizing constant. This result is useful for constructing density
functions with desired functional properties (domain, shape,
symmetry, and so on).

19. Let g(x) = x2 for x in {-2, -1, 0, 1, 2}.

a. Find the probability density function f that is proportional to g.

b. Graph the density function and identify the modes.
c. Find P(X {-1, 1, 2}) where X is a random variable with the
density in (a)..

20. Let g(n) = qn for n = 0, 1, 2, ... where q is a parameter in (0,1).

a. Find the probability density function f that is proportional to g.

b. Find P(X < 2) where X is a random variable with the density
in (a).
c. Find the probability that X is even.

The distribution constructed in the last exercise is a version of the

geometric distribution, and is studied in detail in the chapter on
Bernoulli Trials.

21. Let g(x, y) = x + y for (x, y) {0, 1, 2}2.

mca-5 271

a. Find the probability density function f that is proportional to g.

b. Find the mode of the distribution.
c. Find P(X > Y) where (X, Y) is a random vector with the
density in (a).

22. Let g(x, y) = xy for (x, y) {(1, 1), (1, 2), (1, 3), (2, 2), (2, 3),
(3, 3)}.

a. Find the probability density function f that is proportional to g.

b. Find the mode of the distribution.
c. Find P([(X, Y) {(1, 2), (1, 3), (2, 2), (2, 3)}] where (X, Y) is a
random vector with the density in (a).

Conditional Densities

The density function of a random variable X is based, of course, on

the underlying probability measure P on the sample space R for the
experiment. This measure could be a conditional probability
measure, conditioned on a given event E (with P(E) > 0). The usual
notation is

f(x | E) = P(X = x | E) for x in S.

The following exercise shows that, except for notation, no new

concepts are involved. Therefore, all results that hold for densities
in general have analogues for conditional densities.

23. Show that as a function of x for fixed E, f(x | E) is a discrete

density function. That is, show that it satisfies properties (a) and (b)
of Exercise 1, and show that property (c) becomes

P(X A | E) = x in A f(x | E) for A S.

24. Suppose that B S and P(X B) > 0. Show that the

conditional density of X given X B is

a. f(x | X B) = f(x) / P(X B) for x B.

b. f(x | X B) = 0 if x Bc.

25. Suppose that X is uniformly distributed on a finite set S and

that B is a nonempty subset of S. Show that the conditional
distribution of X given X B is uniform on B.

26. Suppose that X has probability density function f(x) = x2 / 10

for x = -2, -1, 0, 1, 2. Find the conditional density of X given that X >
0.
mca-5 272

27. A pair of fair dice are rolled. Let Y denote the sum of the
scores and U the minimum score. Find the conditional density of U
given Y = 8.

28. Run the dice experiment 200 times, updating after every run.
Compute the empirical conditional density of U given Y = 8 and
compare with the conditional density in the last exercise.

Law of Total Probability and Bayes' Theorem

Suppose that X is a discrete random variable taking values in a

countable set S, and that B be an event in the experiment (that is, a
subset of the underlying sample space R).

29. Prove the law of total probability:

P(B) = x in S P(X = x) P(B | X = x).

This result is useful, naturally, when the distribution of X and the

conditional probability of B given the values of X are known. We
sometimes say that we are conditioning on X.

30. Prove Bayes' Theorem, named after Thomas Bayes:

P(X = x | B) = P(X = x) P(B | X = x) / y in S P(X = y) P(B | X =

y) for x in S.

Bayes' theorem is a formula for the conditional density of X given B.

As with the law of total probability, it is useful, when the quantities
on the right are known. The (unconditional) distribution of X is
referred to as the prior distribution and the conditional density as
the posterior density.

31. In the die-coin experiment, a fair die is rolled and then a fair
coin is tossed the number of times showing on the die.

a. Find the probability that there will be exactly two heads.

b. Given that there were 2 heads, find the conditional density of
the die score.

32. Run the die-coin experiment 200 times, updating after each
run.

a. Compute the empirical probability of exactly two heads and

compare with the probability in the last exercise.
mca-5 273

b. Compute the empirical conditional density of the die score

given exactly two heads and compare with the theoretical
conditional density in the last exercise..

33. Suppose that a bag contains 12 coins: 5 are fair, 4 are biased
with probability of heads 1/3; and 3 are two-headed. A coin is
chosen at random from the bag and tossed twice.

a. Find the probability that there will be exactly 2 heads.

b. Given that there were 2 heads, find the conditional density of
the type of coin

Compare Exercises 31 and 33. In Exercise 31, we toss a coin with

a fixed probability of heads a random number of times. In Exercise
33, we effectively toss a coin with a random probability of heads a
fixed number of times.

34. In the coin-die experiment, a fair coin is tossed. If the coin

lands tails, a fair die is rolled. If the coin lands heads, an ace-six flat
die is tossed (1 and 6 have probability 1/4 each, while 2, 3, 4, 5
have probability 1/8 each). Find the density function of the die
score.

35. Run the coin-die experiment 1000 times, updating every 10

runs. Compare the empirical density of the die score with the
theoretical density in the last exercise.

36. A plant has 3 assembly lines that produces memory chips.

Line 1 produces 50% of the chips and has a defective rate of 4%;
line 2 has produces 30% of the chips and has a defective rate of
5%; line 3 produces 20% of the chips and has a defective rate of
1%. A chip is chosen at random from the plant.

a. Find the probability that the chip is defective.

b. Given that the chip is defective, find the conditional density
of the line that produced the chip.

Data Analysis Exercises

37. In the M&M data, let R denote the number of red candies and
N the total number of candies. Compute and graph the empirical
density of

a. R
b. N
c. R given N > 57.
mca-5 274

38. In the Cicada data, let G denotes gender, S denotes species

type, and W denotes body weight (in grams). Compute the
empirical density of

a. G
b. S
c. (G, S)
d. G given W > 0.20 grams.

Discrete probability distribution

Introduction:

In lecture number two, we said a Random Variable is a quantity

resulting from a random experiment that, by chance, can assume
different values. Such as, number of defective light bulbs produced
during a week. Also, we said a Discrete Random Variable is a
variable which can assume only integer values, such as, 7, 9, and
so on. In other words, a discrete random variable cannot take
fractions as value. Things such as people, cars, or defectives are
things we can count and are discrete items.
In this lecture note, we would like to discuss three types of Discrete
Probability Distribution: Binomial Distribution, Poisson
Distribution, and Hypergeometric Distribution.

Probability Distribution:

A probability distribution is similar to the frequency distribution of a

quantitative population because both provide a long-run frequency
for outcomes. In other words, a probability distribution is listing of all
the possible values that a random variable can take along with their
probabilities. for example, suppose we want to find out the
probability distribution for the number of heads on three tosses of a
coin:

First toss.........T T T T H H H H
Second toss.....T T H H T T H H
Third toss........T H T H T H T H

the probability distribution of the above experiment is as follows

(columns 1, and 2 in the following table).

(Column 1)......................(Column 2)..............(Column 3)

Number of heads...............Probability.................(1)(2)

X.....................................P(X)..........................(X)P(X)
0......................................1/8................................0.0
1......................................3/8................................0.375
2......................................3/8................................0.75
mca-5 275

3......................................1/8................................0.375
Total.....................................................................1.5 = E(X)

Mean, and Variance of Discrete Random Variables:

The equation for computing the mean, or expected value of

discrete random variables is as follows:

Mean = E(X) = Summation[X.P(X)]

where: E(X) = expected value, X = an event, and P(X) = probability
of the event

Note that in the above equation, the probability of each event is

used as the weight. For example, going back to the problem of
tossing a coin three times, the expected value is: E(X) =
[0(1/8)+1(3/8)+2(3/8)+3(1/8) = 1.5 (column 3 in the above table).
Thus, on the average, the number of heads showing face up in a
large number of tossing a coin is 1.5. The expected value has many
uses in gambling, for example, it tells us what our long-run average
losses per play will be.

The equations for computing the expected value, varance, and

standard deviation of discrete random variables are as follows:

Example:

Suppose a charity organization is mailing printed return-address

stickers to over one million homes in the U.S. Each recipient is
asked to donate either $1, $2, $5, $10, $15, or $20. Based on past
experience, the amount a person donates is believed to follow the
following probability distribution:
mca-5 276

X:..... $1......$2........$5......$10.........$15......$20
P(X)....0.1.....0.2.......0.3.......0.2..........0.15.....0.05

The question is, what is expected that an average donor to

contribute, and what is the standard devation. The solution is as
follows.

(1)......(2).......(3).............(4)..................(5).........................................
.(6)
X......P(X)....X.P(X).......X - mean......[(X -
mean)]squared...............(5)x(2)
1.......0.1......0.1...........- 6.25............. . 39.06..
...................................... 3.906
2.......0.2......0.4......... - 5.25.............. . 27.56
..................................... ..5.512
5.......0.3......1.5...........- 2.25.............. . .5.06.
...................................... 1.518
10.....0.2......2.0............ .2.75.............. 7.56..
...................................... 1.512
15.....0.15....2.25...........7.75...............60.06......................................
. . 9.009
20.....0.05....1.0...........12.75.............162.56......................................
... 8.125
Total...........7.25 =
E(X)....................................................................29.585
Thus, the expected value is $7.25, and standard deviation is the
square root of $29.585, which is equal to $5.55. In other words, an
average donor is expected to donate $7.25 with a standard
deviation of $5.55.

Binomial Distribution:

One of the most widely known of all discrete probability distributions

is the binomial distribution. Several characteristics underlie the use
of the binomial distribution.

Characteristics of the Binomial Distribution:

1. The experiment consists of n identical trials.

2. Each trial has only one of the two possible mutually exclusive
outcomes, success or a failure.
3. The probability of each outcome does not change from trial to
trial, and
4. The trials are independent, thus we must sample with
replacement.

Note that if the sample size, n, is less than 5% of the population,

the independence assumption is not of great concern. Therefore
mca-5 277

the acceptable sample size for using the binomial distribution with
samples taken without replacement is [n<5% n] where n is equal to
the sample size, and N stands for the size of the population. The
birth of children (male or female), true-false or multiple-choice
questions (correct or incorrect answers) are some examples of the
binomialdistribution.

BinomialEquation:

When using the binomial formula to solve problems, all that is

necessary is that we be able to identify three things: the number of
trials (n), the probability of a success on any one trial (p), and the
number of successes desired (X). The formulas used to compute
the probability, the mean, and the standard deviation of a binomial
distribution are as follows.

where: n = the sample size or the number of trials, X = the number

of successes desired, p = probability of getting a success in one
trial, and q = (1 - p) = the probability of getting a failure in one trial.

Example:

Let's go back to lecture number four and solve the probability

problem of defective TVs by applying the binomial equation once
again. We said, suppose that 4% of all TVs made by W&B
Company in 1995 are defective. If eight of these TVs are randomly
selected from across the country and tested, what is the probability
that exactly three of them are defective? Assume that each TV is
made independently of the others.
mca-5 278

In this problem, n=8, X=3, p=0.04, and q=(1-p)=0.96. Plugging

these numbers into the binomial formula (see the above equation)
we get: P(X) = P(3) = 0.0003 or 0.03% which is the same answer
as in lecture number four. The mean is equal to (n) x (p) =
(8)(0.04)=0.32, the variance is equal to np (1 - p) = (0.32)(0.96) =
0.31, and the standard deviation is the square root of 0.31, which is
equal to 0.6.

The Binomial Table:Mathematicians constructed a set of binomial

tables containing presolved probabilities. Binomial distributions are
a family of distributions. In other words, every different value of n
and/or every different value of p gives a different binomial
distribution. Tables are available for different combinations of n and
p values. For the tables, refer to the text. Each table is headed by a
value of n, and values of p are presented in the top row of each
table of size n. In the column below each value of p is the binomial
distribution for that value of n and p. The binomial tables are easy
to use. Simply look up n and p, then find X (located in the first
column of each table), and read the corresponding probability. The
following table is the binomial probabilities for n = 6. Note that the
probabilities in each column of the binomial table must add up to
1.0.

Binomial Probability Distribution Table

(n = 6)
----------------------------------------------------------------------------------------
Probability
X.....0.1........0.2.....0.3.....0.4.....0.5.....0.6.....0.7.....0.8.....0.9
--------------------------------------------------------------------------------------
0.....0.531............0.118....................................................0.000
1.....0.354............0.303....................................................0.000
2.....0.098............0.324....................................................0.001
3.....0.015............0.185....................................................0.015
4.....0.001............0.060....................................................0.098
5.....0.000............0.010....................................................0.354
6.....0.000............0.001....................................................0.531
--------------------------------------------------------------------------------------

Example:

Suppose that an examination consists of six true and false

questions, and assume that a student has no knowledge of the
subject matter. The probability that the student will guess the
correct answer to the first question is 30%. Likewise, the probability
of guessing each of the remaining questions correctly is also 30%.
What is the probability of getting more than three correct answers?
For the above problem, n = 6, p = 0.30, and X >3. In the above
table, search along the row of p values for 0.30. The problem is to
locate the P(X > 3). Thus, the answer involves summing the
mca-5 279

probabilities for X = 4, 5, and 6. These values appear in the X

column at the intersection of each X value and p = 0.30, as follows:
P (X > 3) = Summation of {P (X=4) + P(X=5) +P(X=6)} =
(0.060)+(0.010)+(0.001) = 0.071 or 7.1%
Thus, we may conclude that if 30% of the exam questions are
answered by guessing, the probability is 0.071 (or 7.1%) that more
than four of the questions are answered correctly by the student.

Graphing the Binomial Distribution:

The graph of a binomial distribution can be constructed by using all

the possible X values of a distribution and their associated
probabilities. The X values are graphed along the X axis, and the
probabilities are graphed along the Y axis. Note that the graph of
the binomial distribution has three shapes: If p<0.5, the graph is
positively skewed, if p>0.5, the graph is negatively skewed, and if
p=0.5, the graph is symmetrical. The skewness is eliminated as n
gets large. In other words, if n remains constant but p becomes
larger and larger up to 0.50, the shape of the binomial probability
distribution becomes more symmetrical. If p remains the same but
n becomes larger and larger, the shape of the binomial probability
distribution becomes more symmetrical.

Example:

In a large consignment of electric bulb 10% are defective.A random

sample of 20 is taken for inspection.Find the probability that a)all
are good bulbs b) there are almost 3 defective bulbs c)there are
exactly 3 defective bulbs

Solution:

Here n= 20: p=10/100 =0.1: q=0.9;

By binomial distribution,the probability of getting x defective bulbs

a)probability of getting all good bulbs = probability of getting zero

defective bulbs.

=p(x=0)

=(0.9)20

b)p(x<=3)=p(x=0)+p(x=1)+p(x=2)+p(x=3)

=.8671

c)p(x=3)=.1901
mca-5 280

Binomial Distribution

To understand binomial distributions and binomial probability, it

helps to understand binomial experiments and some associated
notation; so we cover those topics first.

Binomial Experiment
A binomial experiment (also known as a Bernoulli trial) is a
statistical experiment that has the following properties:

 The experiment consists of n repeated trials.

 Each trial can result in just two possible outcomes. We call
one of these outcomes a success and the other, a failure.
 The probability of success, denoted by P, is the same on
every trial.
 The trials are independent; that is, the outcome on one trial
does not affect the outcome on other trials.

Consider the following statistical experiment. You flip a coin 2 times

and count the number of times the coin lands on heads. This is a
binomial experiment because:

 The experiment consists of repeated trials. We flip a coin 2

times.
 Each trial can result in just two possible outcomes - heads or
tails.
 The probability of success is constant - 0.5 on every trial.
 The trials are independent; that is, getting heads on one trial
does not affect whether we get heads on other trials.

Notation
The following notation is helpful, when we talk about binomial
probability.

 x: The number of successes that result from the binomial

experiment.
 n: The number of trials in the binomial experiment.
 P: The probability of success on an individual trial.
 Q: The probability of failure on an individual trial. (This is
equal to 1 - P.)
 b(x; n, P): Binomial probability - the probability that an n-trial
binomial experiment results in exactly x successes, when the
probability of success on an individual trial is P.
mca-5 281

 nCr: The number of combinations of n things, taken r at a

time.

Binomial Distribution
A binomial random variable is the number of successes x in n
repeated trials of a binomial experiment. The probability distribution
of a binomial random variable is called a binomial distribution
(also known as a Bernoulli distribution).

Suppose we flip a coin two times and count the number of heads
(successes). The binomial random variable is the number of heads,
which can take on values of 0, 1, or 2. The binomial distribution is
presented below.

Number of heads Probability

0 0.25
1 0.50
2 0.25

The binomial distribution has the following properties:

 The mean of the distribution (μx) is equal to n * P .

 The variance (σ2x) is n * P * ( 1 - P ).
 The standard deviation (σx) is sqrt[ n * P * ( 1 - P ) ].

Binomial Probability
The binomial probability refers to the probability that a binomial
experiment results in exactly x successes. For example, in the
above table, we see that the binomial probability of getting exactly
one head in two coin flips is 0.50.

Given x, n, and P, we can compute the binomial probability based

on the following formula:

Binomial Formula. Suppose a binomial experiment consists of n

trials and results in x successes. If the probability of success on an
individual trial is P, then the binomial probability is:

b(x; n, P) = nCx * Px * (1 - P)n - x

mca-5 282

Example 1

Suppose a die is tossed 5 times. What is the probability of getting

exactly 2 fours?

Solution: This is a binomial experiment in which the number of trials

is equal to 5, the number of successes is equal to 2, and the
probability of success on a single trial is 1/6 or about 0.167.
Therefore, the binomial probability is:

b(2; 5, 0.167) = 5C2 * (0.167)2 * (0.833)3

b(2; 5, 0.167) = 0.161

Cumulative Binomial Probability

A cumulative binomial probability refers to the probability that
the binomial random variable falls within a specified range (e.g., is
greater than or equal to a stated lower limit and less than or equal
to a stated upper limit).

For example, we might be interested in the cumulative binomial

probability of obtaining 45 or fewer heads in 100 tosses of a coin
(see Example 1 below). This would be the sum of all these
individual binomial probabilities.

b(x < 45; 100, 0.5) = b(x = 0; 100, 0.5) + b(x = 1; 100, 0.5) + ... +
b(x = 44; 100, 0.5) + b(x = 45; 100, 0.5)

Binomial Calculator
As you may have noticed, the binomial formula requires many time-
consuming computations. The Binomial Calculator can do this work
for you - quickly, easily, and error-free. Use the Binomial Calculator
to compute binomial probabilities and cumulative binomial
probabilities. The calculator is free. It can be found under the Stat
Tables menu item, which appears in the header of every Stat Trek
web page.

Binomial
Calculator

Example 1

What is the probability of obtaining 45 or fewer heads in 100 tosses

of a coin?
mca-5 283

Solution: To solve this problem, we compute 46 individual

probabilities, using the binomial formula. The sum of all these
probabilities is the answer we seek. Thus,

b(x < 45; 100, 0.5) = b(x = 0; 100, 0.5) + b(x = 1; 100, 0.5) + . . . +
b(x = 45; 100, 0.5)
b(x < 45; 100, 0.5) = 0.184

Example 2

The probability that a student is accepted to a prestigeous college

is 0.3. If 5 students from the same school apply, what is the
probability that at most 2 are accepted?

Solution: To solve this problem, we compute 3 individual

probabilities, using the binomial formula. The sum of all these
probabilities is the answer we seek. Thus,

b(x < 2; 5, 0.3) = b(x = 0; 5, 0.3) + b(x = 1; 5, 0.3) + b(x = 2; 5, 0.3)

b(x < 2; 5, 0.3) = 0.1681 + 0.3601 + 0.3087
b(x < 2; 5, 0.3) = 0.8369

Example 3

What is the probability that the world series will last 4 games? 5
games? 6 games? 7 games? Assume that the teams are evenly
matched.

Solution: This is a very tricky application of the binomial distribution.

If you can follow the logic of this solution, you have a good
understanding of the material covered in the tutorial, to this point.

In the world series, there are two baseball teams. The series ends
when the winning team wins 4 games. Therefore, we define a
success as a win by the team that ultimately becomes the world
series champion.

For the purpose of this analysis, we assume that the teams are
evenly matched. Therefore, the probability that a particular team
wins a particular game is 0.5.

Let's look first at the simplest case. What is the probability that the
series lasts only 4 games. This can occur if one team wins the first
4 games. The probability of the National League team winning 4
games in a row is:

b(4; 4, 0.5) = 4C4 * (0.5)4 * (0.5)0 = 0.0625

mca-5 284

Similarly, when we compute the probability of the American League

team winning 4 games in a row, we find that it is also 0.0625.
Therefore, probability that the series ends in four games would be
0.0625 + 0.0625 = 0.125; since the series would end if either the
American or National League team won 4 games in a row.

Now let's tackle the question of finding probability that the world
series ends in 5 games. The trick in finding this solution is to
recognize that the series can only end in 5 games, if one team has
won 3 out of the first 4 games. So let's first find the probability that
the American League team wins exactly 3 of the first 4 games.

b(3; 4, 0.5) = 4C3 * (0.5)3 * (0.5)1 = 0.25

Okay, here comes some more tricky stuff, so listen up. Given that
the American League team has won 3 of the first 4 games, the
American League team has a 50/50 chance of winning the fifth
game to end the series. Therefore, the probability of the American
League team winning the series in 5 games is 0.25 * 0.50 = 0.125.
Since the National League team could also win the series in 5
games, the probability that the series ends in 5 games would be
0.125 + 0.125 = 0.25.

The rest of the problem would be solved in the same way. You
should find that the probability of the series ending in 6 games is
0.3125; and the probability of the series ending in 7 games is also
0.3125.

While this is statistically correct in theory, over the years the actual
world series has turned out differently, with more series than
expected lasting 7 games. For an interesting discussion of why
world series reality differs from theory, see Ben Stein's explanation
of why 7-game world series are more common than expected.

Negative Binomial and Geometric Distributions

In this lesson, we cover the negative binomial distribution and the

geometric distribution. As we will see, the geometric distribution is a
special case of the negative binomial distribution.

Negative Binomial Experiment

A negative binomial experiment is a statistical experiment that
has the following properties:

 The experiment consists of x repeated trials.

 Each trial can result in just two possible outcomes. We call
one of these outcomes a success and the other, a failure.
mca-5 285

 The probability of success, denoted by P, is the same on

every trial.
 The trials are independent; that is, the outcome on one trial
does not affect the outcome on other trials.
 The experiment continues until r successes are observed,
where r is specified in advance.

Consider the following statistical experiment. You flip a coin

repeatedly and count the number of times the coin lands on heads.
You continue flipping the coin until it has landed 5 times on heads.
This is a negative binomial experiment because:

 The experiment consists of repeated trials. We flip a coin

repeatedly until it has landed 5 times on heads.
 Each trial can result in just two possible outcomes - heads or
tails.
 The probability of success is constant - 0.5 on every trial.
 The trials are independent; that is, getting heads on one trial
does not affect whether we get heads on other trials.
 The experiment continues until a fixed number of successes
have occurred; in this case, 5 heads.

Notation
The following notation is helpful, when we talk about negative
binomial probability.

 x: The number of trials required to produce r successes in a

negative binomial experiment.
 r: The number of successes in the negative binomial
experiment.
 P: The probability of success on an individual trial.
 Q: The probability of failure on an individual trial. (This is
equal to 1 - P.)
 b*(x; r, P): Negative binomial probability - the probability that
an x-trial negative binomial experiment results in the rth
success on the xth trial, when the probability of success on
an individual trial is P.
 nCr: The number of combinations of n things, taken r at a
time.

Negative Binomial Distribution

A negative binomial random variable is the number X of
repeated trials to produce r successes in a negative binomial
experiment. The probability distribution of a negative binomial
random variable is called a negative binomial distribution. The
mca-5 286

negative binomial distribution is also known as the Pascal

distribution.

Suppose we flip a coin repeatedly and count the number of heads

(successes). If we continue flipping the coin until it has landed 2
times on heads, we are conducting a negative binomial experiment.
The negative binomial random variable is the number of coin flips
required to achieve 2 heads. In this example, the number of coin
flips is a random variable that can take on any integer value
between 2 and plus infinity. The negative binomial probability
distribution for this example is presented below.

Number of coin flips Probability

2 0.25

3 0.25

4 0.1875

5 0.125

6 0.078125

7 or more 0.109375

Negative Binomial Probability

The negative binomial probability refers to the probability that a
negative binomial experiment results in r - 1 successes after trial x -
1 and r successes after trial x. For example, in the above table, we
see that the negative binomial probability of getting the second
head on the sixth flip of the coin is 0.078125.

Given x, r, and P, we can compute the negative binomial probability

based on the following formula:

Negative Binomial Formula. Suppose a negative binomial

experiment consists of x trials and results in r successes. If the
mca-5 287

probability of success on an individual trial is P, then the negative

binomial probability is:

b(x; r, P) = x-1Cr-1 Pr * (1 - P)x - r

The negative binomial distribution has the following properties:

 The mean of the distribution is: μ = rQ / P .

 The variance is: σ2 = rQ / P2 .

Geometric Distribution
The geometric distribution is a special case of the negative
binomial distribution. It deals with the number of trials required for a
single success. Thus, the geometric distribution is negative
binomial distribution where the number of successes (r) is equal to
1.

An example of a geometric distribution would be tossing a coin until

it lands on heads. We might ask: What is the probability that the
first head occurs on the third flip? That probability is referred to as a
geometric probability and is denoted by g(x; P). The formula for
geometric probability is given below.

Geometric Probability Formula. Suppose a negative binomial

experiment consists of x trials and results in one success. If the
probability of success on an individual trial is P, then the geometric
probability is:

g(x; P) = P * Qx - 1

The geometric distribution has the following properties:

 The mean of the distribution is: μ = Q / P .

 The variance is: σ2 = Q / P2 .

Sample Problems
The problems below show how to apply your new-found knowledge
of the negative binomial distribution (see Example 1) and the
geometric distribution (see Example 2).

Negative Binomial Calculator

mca-5 288

As you may have noticed, the negative binomial formula requires

some potentially time-consuming computations. The Negative
Binomial Calculator can do this work for you - quickly, easily, and
error-free. Use the Negative Binomial Calculator to compute
negative binomial probabilities and geometric probabilities. The
calculator is free. It can be found under the Stat Tables menu item,
which appears in the header of every Stat Trek web page.

Negative Binomial
Calculator

Example 1

Bob is a high school basketball player. He is a 70% free throw

shooter. That means his probability of making a free throw is 0.70.
During the season, what is the probability that Bob makes his third
free throw on his fifth shot?

Solution: This is an example of a negative binomial experiment.

The probability of success (P) is 0.70, the number of trials (x) is 5,
and the number of successes (r) is 3.

To solve this problem, we enter these values into the negative

binomial formula.

b*(x; r, P) = x-1Cr-1 * Pr * Qx - r

b(5; 3, 0.7) = 4C2 0.73 * 2

0.3
b*(5; 3, 0.7) = 6 * 0.343 * 0.09 = 0.18522

Thus, the probability that Bob will make his third successful free
throw on his fifth shot is 0.18522.

Example 2

Let's reconsider the above problem from Example 1. This time, we'll
ask a slightly different question: What is the probability that Bob
makes his first free throw on his fifth shot?

Solution: This is an example of a geometric distribution, which is a

special case of a negative binomial distribution. Therefore, this
problem can be solved using the negative binomial formula or the
geometric formula. We demonstrate each approach below,
beginning with the negative binomial formula.

The probability of success (P) is 0.70, the number of trials (x) is 5,

and the number of successes (r) is 1. We enter these values into
the negative binomial formula.
mca-5 289

b*(x; r, P) = x-1Cr-1 * Pr * Qx - r

b(5; 1, 0.7) = 4C0 0.71 * 0.34

b*(5; 3, 0.7) = 0.00567

Now, we demonstate a solution based on the geometric formula.

g(x; P) = P * Qx - 1

g(5; 0.7) = 0.7 * 0.34 = 0.00567

Notice that each approach yields the same answer.

Hypergeometric Distribution

This lesson covers hypergeometric experiments, hypergeometric

distributions, and hypergeometric probability.

Hypergeometric Experiments
A hypergeometric experiment is a statistical experiment that has
the following properties:

 A sample of size n is randomly selected without replacement

from a population of N items.
 In the population, k items can be classified as successes,
and N - k items can be classified as failures.

Consider the following statistical experiment. You have an urn of 10

marbles - 5 red and 5 green. You randomly select 2 marbles
without replacement and count the number of red marbles you have
selected. This would be a hypergeometric experiment.

Note that it would not be a binomial experiment. A binomial

experiment requires that the probability of success be constant on
every trial. With the above experiment, the probability of a success
changes on every trial. In the beginning, the probability of selecting
a red marble is 5/10. If you select a red marble on the first trial, the
probability of selecting a red marble on the second trial is 4/9. And
if you select a green marble on the first trial, the probability of
selecting a red marble on the second trial is 5/9.

Note further that if you selected the marbles with replacement, the
probability of success would not change. It would be 5/10 on every
trial. Then, this would be a binomial experiment.

Notation
mca-5 290

The following notation is helpful, when we talk about

hypergeometric distributions and hypergeometric probability.

 N: The number of items in the population.

 k: The number of items in the population that are classified
as successes.
 n: The number of items in the sample.
 x: The number of items in the sample that are classified as
successes.
 kCx: The number of combinations of k things, taken x at a
time.
 h(x; N, n, k): hypergeometric probability - the probability
that an n-trial hypergeometric experiment results in exactly x
successes, when the population consists of N items, k of
which are classified as successes.

Hypergeometric Distribution
A hypergeometric random variable is the number of successes
that result from a hypergeometric experiment. The probability
distribution of a hypergeometric random variable is called a
hypergeometric distribution.

Given x, N, n, and k, we can compute the hypergeometric

probability based on the following formula:

Hypergeometric Formula. Suppose a population consists of N

items, k of which are successes. And a random sample drawn from
that population consists on n items, x of which are successes. Then
the hypergeometric probability is:

h(x; N, n, k) = [ kCx ] [ N-kCn-x ] / [ NCn ]

The hypergeometric distribution has the following properties:

 The mean of the distribution is equal to n * k / N .

 The variance is n * k * ( N - k ) * ( N - n ) / [ N2 * ( N - 1 ) ] .

Example 1

Suppose we randomly select 5 cards without replacement from an

ordinary deck of playing cards. What is the probability of getting
exactly 2 red cards (i.e., hearts or diamonds)?

Solution: This is a hypergeometric experiment in which we know

the following:
mca-5 291

 N = 52; since there are 52 cards in a deck.

 k = 26; since there are 26 red cards in a deck.
 n = 5; since we randomly select 5 cards from the deck.
 x = 2; since 2 of the cards we select are red.

We plug these values into the hypergeometric formula as follows:

h(x; N, n, k) = [ kCx ] [ N-kCn-x ] / [ NCn ]

h(2; 52, 5, 26) = [ 26C2 ] [ 26C3 ] / [ 52C5 ]
h(2; 52, 5, 26) = [ 325 ] [ 2600 ] / [ 2,598,960 ] = 0.32513

Thus, the probability of randomly selecting 2 red cards is 0.32513.

Hypergeometric Calculator
As you surely noticed, the hypergeometric formula requires many
time-consuming computations. The Stat Trek Hypergeometric
Calculator can do this work for you - quickly, easily, and error-free.
Use the Hypergeometric Calculator to compute hypergeometric
probabilities and cumulative hypergeometric probabilities. The
calculator is free. It can be found under the Stat Tables menu item,
which appears in the header of every Stat Trek web page.

Hypergeometric
Calculator

Cumulative Hypergeometric Probability

A cumulative hypergeometric probability refers to the probability
that the hypergeometric random variable is greater than or equal to
some specified lower limit and less than or equal to some specified
upper limit.

For example, suppose we randomly select five cards from an

ordinary deck of playing cards. We might be interested in the
cumulative hypergeometric probability of obtaining 2 or fewer
hearts. This would be the probability of obtaining 0 hearts plus the
probability of obtaining 1 heart plus the probability of obtaining 2
hearts, as shown in the example below.

Example 1

Suppose we select 5 cards from an ordinary deck of playing cards.

What is the probability of obtaining 2 or fewer hearts?
mca-5 292

Solution: This is a hypergeometric experiment in which we know

the following:

 N = 52; since there are 52 cards in a deck.

 k = 13; since there are 13 hearts in a deck.
 n = 5; since we randomly select 5 cards from the deck.
 x = 0 to 2; since our selection includes 0, 1, or 2 hearts.

We plug these values into the hypergeometric formula as follows:

h(x < x; N, n, k) = h(x < 2; 52, 5, 13)

h(x < 2; 52, 5, 13) = h(x = 0; 52, 5, 13) + h(x = 1; 52, 5, 13) + h(x =
2; 52, 5, 13)
h(x < 2; 52, 5, 13) = [ (13C0) (39C5) / (52C5) ] + [ (13C1) (39C4) / (52C5) ]
+ [ (13C2) (39C3) / (52C5) ]
h(x < 2; 52, 5, 13) = [ (1)(575,757)/(2,598,960) ] + [
(13)(82,251)/(270,725) ] + [ (78)(9139)/(22,100) ]
h(x < 2; 52, 5, 13) = [ 0.2215 ] + [ 0.4114 ] + [ 0.2743 ]
h(x < 2; 52, 5, 13) = 0.9072

Thus, the probability of randomly selecting at most 2 hearts is

0.9072.

Multinomial Distribution

Multinomial Experiment
A multinomial experiment is a statistical experiment that has the
following properties:

 The experiment consists of n repeated trials.

 Each trial has a discrete number of possible outcomes.
 On any given trial, the probability that a particular outcome
will occur is constant.
 The trials are independent; that is, the outcome on one trial
does not affect the outcome on other trials.

Consider the following statistical experiment. You toss two dice

three times, and record the outcome on each toss. This is a
multinomial experiment because:

 The experiment consists of repeated trials. We toss the dice

three times.
 Each trial can result in a discrete number of outcomes - 2
through 12.
 The probability of any outcome is constant; it does not
change from one toss to the next.
mca-5 293

 The trials are independent; that is, getting a particular

outcome on one trial does not affect the outcome on other
trials.

Note: A binomial experiment is a special case of a multinomial

experiment. Here is the main difference. With a binomial
experiment, each trial can result in two - and only two - possible
outcomes. With a multinomial experiment, each trial can have two
or more possible outcomes.

Multinomial Distribution
A multinomial distribution is the probability distribution of the
outcomes from a multinomial experiment. The multinomial formula
defines the probability of any outcome from a multinomial
experiment.

Multinomial Formula. Suppose a multinomial experiment consists

of n trials, and each trial can result in any of k possible outcomes:
E1, E2, . . . , Ek. Suppose, further, that each possible outcome can
occur with probabilities p1, p2, . . . , pk. Then, the probability (P) that
E1 occurs n1 times, E2 occurs n2 times, . . . , and Ek occurs nk times
is

P = [ n! / ( n1! * n2! * ... nk! ) ] * ( p1n1 * p2n2 * . . . * pknk )

where n = n1 + n1 + . . . + nk.

The examples below illustrate how to use the multinomial formula

to compute the probability of an outcome from a multinomial
experiment.

Multinomial Calculator
As you may have noticed, the multinomial formula requires many
time-consuming computations. The Multinomial Calculator can do
this work for you - quickly, easily, and error-free. Use the
Multinomial Calculator to compute the probability of outcomes from
multinomial experiments. The calculator is free. It can be found
under the Stat Tables menu item, which appears in the header of
every Stat Trek web page.

Multinomial
Calculator
mca-5 294

Multinomial Probability: Sample

Problems
Example 1

Suppose a card is drawn randomly from an ordinary deck of playing

cards, and then put back in the deck. This exercise is repeated five
times. What is the probability of drawing 1 spade, 1 heart, 1
diamond, and 2 clubs?

Solution: To solve this problem, we apply the multinomial formula.

We know the following:

 The experiment consists of 5 trials, so n = 5.

 The 5 trials produce 1 spade, 1 heart, 1 diamond, and 2
clubs; so n1 = 1, n2 = 1, n3 = 1, and n4 = 2.
 On any particular trial, the probability of drawing a spade,
heart, diamond, or club is 0.25, 0.25, 0.25, and 0.25,
respectively. Thus, p1 = 0.25, p2 = 0.25, p3 = 0.25, and p4 =
0.25.

We plug these inputs into the multinomial formula, as shown below:

P = [ n! / ( n1! * n2! * ... nk! ) ] * ( p1n1 * p2n2 * . . . * pknk )

P = [ 5! / ( 1! * 1! * 1! * 2! ) ] * [ (0.25)1 * (0.25)1 * (0.25)1 * (0.25)2 ]
P = 0.05859

Thus, if we draw five cards with replacement from an ordinary deck

of playing cards, the probability of drawing 1 spade, 1 heart, 1
diamond, and 2 clubs is 0.05859.

Example 2

Suppose we have a bowl with 10 marbles - 2 red marbles, 3 green

marbles, and 5 blue marbles. We randomly select 4 marbles from
the bowl, with replacement. What is the probability of selecting 2
green marbles and 2 blue marbles?

Solution: To solve this problem, we apply the multinomial formula.

We know the following:

 The experiment consists of 4 trials, so n = 4.

 The 4 trials produce 0 red marbles, 2 green marbles, and 2
blue marbles; so nred = 0, ngreen = 2, and nblue = 2.
mca-5 295

 On any particular trial, the probability of drawing a red,

green, or blue marble is 0.2, 0.3, and 0.5, respectively. Thus,
pred = 0.2, pgreen = 0.3, and pblue = 0.5

We plug these inputs into the multinomial formula, as shown below:

P = [ n! / ( n1! * n2! * ... nk! ) ] * ( p1n1 * p2n2 * . . . * pknk )

P = [ 4! / ( 0! * 2! * 2! ) ] * [ (0.2)0 * (0.3)2 * (0.5)2 ]
P = 0.135

Thus, if we draw 4 marbles with replacement from the bowl, the

probability of drawing 0 red marbles, 2 green marbles, and 2 blue
marbles is 0.135.

The Poisson Distribution:

The poisson distribution is another discrete probability distribution.

It is named after Simeon-Denis Poisson (1781-1840), a French
mathematician. The poisson distribution depends only on the
average number of occurrences per unit time of space. There is no
n, and no p. The poisson probability distribution provides a close
approximation to the binomial probability distribution when n is
large and p is quite small or quite large. In other words, if n>20 and
np<=5 [or n(1-p)<="5]," then we may use poisson distribution as an
approximation to binomial distribution. for detail discussion of the
poisson probability distribution, refer to the text.

Poisson Distribution

Attributes of a Poisson Experiment

A Poisson experiment is a statistical experiment that has the
following properties:

 The experiment results in outcomes that can be classified as

successes or failures.
 The average number of successes (μ) that occurs in a
specified region is known.
 The probability that a success will occur is proportional to the
size of the region.
 The probability that a success will occur in an extremely
small region is virtually zero.

Note that the specified region could take many forms. For instance,
it could be a length, an area, a volume, a period of time, etc.

Notation
mca-5 296

The following notation is helpful, when we talk about the Poisson

distribution.

 e: A constant equal to approximately 2.71828. (Actually, e is

the base of the natural logarithm system.)
 μ: The mean number of successes that occur in a specified
region.
 x: The actual number of successes that occur in a specified
region.
 P(x; μ): The Poisson probability that exactly x successes
occur in a Poisson experiment, when the mean number of
successes is μ.

Poisson Distribution
A Poisson random variable is the number of successes that result
from a Poisson experiment. The probability distribution of a Poisson
random variable is called a Poisson distribution.

Given the mean number of successes (μ) that occur in a specified

region, we can compute the Poisson probability based on the
following formula:

Poisson Formula. Suppose we conduct a Poisson experiment, in

which the average number of successes within a given region is μ.
Then, the Poisson probability is:

P(x; μ) = (e-μ) (μx) / x!

where x is the actual number of successes that result from the

experiment, and e is approximately equal to 2.71828.

The Poisson distribution has the following properties:

 The mean of the distribution is equal to μ .

 The variance is also equal to μ .

Example 1

The average number of homes sold by the Acme Realty company

is 2 homes per day. What is the probability that exactly 3 homes will
be sold tomorrow?
mca-5 297

Solution: This is a Poisson experiment in which we know the

following:

 μ = 2; since 2 homes are sold per day, on average.

 x = 3; since we want to find the likelihood that 3 homes will
be sold tomorrow.
 e = 2.71828; since e is a constant equal to approximately
2.71828.

We plug these values into the Poisson formula as follows:

P(x; μ) = (e-μ) (μx) / x!

P(3; 2) = (2.71828-2) (23) / 3!
P(3; 2) = (0.13534) (8) / 6
P(3; 2) = 0.180

Thus, the probability of selling 3 homes tomorrow is 0.180 .

Poisson Calculator
Clearly, the Poisson formula requires many time-consuming
computations. The Stat Trek Poisson Calculator can do this work
for you - quickly, easily, and error-free. Use the Poisson Calculator
to compute Poisson probabilities and cumulative Poisson
probabilities. The calculator is free. It can be found under the Stat
Tables menu item, which appears in the header of every Stat Trek
web page.

Poisson
Calculator

Cumulative Poisson Probability

A cumulative Poisson probability refers to the probability that the
Poisson random variable is greater than some specified lower limit
and less than some specified upper limit.

Example 1

Suppose the average number of lions seen on a 1-day safari is 5.

What is the probability that tourists will see fewer than four lions on
the next 1-day safari?
mca-5 298

Solution: This is a Poisson experiment in which we know the

following:

 μ = 5; since 5 lions are seen per safari, on average.

 x = 0, 1, 2, or 3; since we want to find the likelihood that
tourists will see fewer than 4 lions; that is, we want the
probability that they will see 0, 1, 2, or 3 lions.
 e = 2.71828; since e is a constant equal to approximately
2.71828.

To solve this problem, we need to find the probability that tourists

will see 0, 1, 2, or 3 lions. Thus, we need to calculate the sum of
four probabilities: P(0; 5) + P(1; 5) + P(2; 5) + P(3; 5). To compute
this sum, we use the Poisson formula:

P(x < 3, 5) = P(0; 5) + P(1; 5) + P(2; 5) + P(3; 5)

P(x < 3, 5) = [ (e-5)(50) / 0! ] + [ (e-5)(51) / 1! ] + [ (e-5)(52) / 2! ] + [ (e-
5
)(53) / 3! ]
P(x < 3, 5) = [ (0.006738)(1) / 1 ] + [ (0.006738)(5) / 1 ] + [
(0.006738)(25) / 2 ] + [ (0.006738)(125) / 6 ]
P(x < 3, 5) = [ 0.0067 ] + [ 0.03369 ] + [ 0.084224 ] + [ 0.140375 ]
P(x < 3, 5) = 0.2650

Thus, the probability of seeing at no more than 3 lions is 0.2650.

Standard Normal Distribution

The standard normal distribution is a special case of the normal
distribution. It is the distribution that occurs when a normal random
variable has a mean of zero and a standard deviation of one.

The normal random variable of a standard normal distribution is

called a standard score or a z-score. Every normal random
variable X can be transformed into a z score via the following
equation:

z = (X - μ) / σ

where X is a normal random variable, μ is the mean mean of X, and

σ is the standard deviation of X.

Standard Normal Distribution Table

A standard normal distribution table shows a cumulative
probability associated with a particular z-score. Table rows show
the whole number and tenths place of the z-score. Table columns
show the hundredths place. The cumulative probability (often from
minus infinity to the z-score) appears in the cell of the table.
mca-5 299

For example, a section of the standard normal table is reproduced

below. To find the cumulative probability of a z-score equal to -1.31,
cross-reference the row of the table containing -1.3 with the column
containing 0.01. The table shows that the probability that a
standard normal random variable will be less than -1.31 is 0.0951;
that is, P(Z < -1.31) = 0.0951.

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-
0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
3.0
... ... ... ... ... ... ... ... ... ... ...
-
0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0722 0.0708 0.0694 0.0681
1.4
-
0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
1.3
-
0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
1.2
... ... ... ... ... ... ... ... ... ... ...
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990

Of course, you may not be interested in the probability that a

standard normal random variable falls between minus infinity and a
given value. You may want to know the probability that it lies
between a given value and plus infinity. Or you may want to know
the probability that a standard normal random variable lies between
two given values. These probabilities are easy to compute from a
normal distribution table. Here's how.

 Find P(Z > a). The probability that a standard normal random
variable (z) is greater than a given value (a) is easy to find.
The table shows the P(Z < a). The P(Z > a) = 1 - P(Z < a).

Suppose, for example, that we want to know the probability

that a z-score will be greater than 3.00. From the table (see
above), we find that P(Z < 3.00) = 0.9987. Therefore, P(Z >
3.00) = 1 - P(Z < 3.00) = 1 - 0.9987 = 0.0013.

 Find P(a < Z < b). The probability that a standard normal
random variables lies between two values is also easy to
find. The P(a < Z < b) = P(Z < b) - P(Z < a).

For example, suppose we want to know the probability that a

z-score will be greater than -1.40 and less than -1.20. From
the table (see above), we find that P(Z < -1.20) = 0.1151;
and P(Z < -1.40) = 0.0808. Therefore, P(-1.40 < Z < -1.20) =
P(Z < -1.20) - P(Z < -1.40) = 0.1151 - 0.0808 = 0.0343.
mca-5 300

In school or on the Advanced Placement Statistics Exam, you may

be called upon to use or interpret standard normal distribution
tables. Standard normal tables are commonly found in appendices
of most statistics texts.

The Normal Distribution as a Model for

Measurements
Often, phenomena in the real world follow a normal (or near-
normal) distribution. This allows researchers to use the normal
distribution as a model for assessing probabilities associated with
real-world phenomena. Typically, the analysis involves two steps.

 Transform raw data. Usually, the raw data are not in the form
of z-scores. They need to be transformed into z-scores,
using the transformation equation presented earlier: z = (X -
μ) / σ.

 Find probability. Once the data have been transformed into

z-scores, you can use standard normal distribution tables,
online calculators (e.g., Stat Trek's free normal distribution
calculator), or handheld graphing calculators to find
probabilities associated with the z-scores.

The problem in the next section demonstrates the use of the normal
distribution as a model for measurement.

Test Your Understanding of This Lesson

Problem 1

Molly earned a score of 940 on a national achievement test. The

mean test score was 850 with a standard deviation of 100. What
proportion of students had a higher score than Molly? (Assume that
test scores are normally distributed.)

(A) 0.10
(B) 0.18
(C) 0.50
(D) 0.82
(E) 0.90

Solution
mca-5 301

The correct answer is B. As part of the solution to this problem, we

assume that test scores are normally distributed. In this way, we
use the normal distribution as a model for measurement. Given an
assumption of normality, the solution involves three steps.

 First, we transform Molly's test score into a z-score, using

the z-score transformation equation.

z = (X - μ) / σ = (940 - 850) / 100 = 0.90

 Then, using an online calculator (e.g., Stat Trek's free

normal distribution calculator), a handheld graphing
calculator, or the standard normal distribution table, we find
the cumulative probability associated with the z-score. In this
case, we find P(Z < 0.90) = 0.8159.

 Therefore, the P(Z > 0.90) = 1 - P(Z < 0.90) = 1 - 0.8159 =

0.1841.

Thus, we estimate that 18.41 percent of the students tested had a

higher score than Molly.

The Hypergeometric Distribution:

Another discrete probability distribution is the hypergeometric

distribution. The binomial probability distribution assumes that the
population from which the sample is selected is very large. For this
reason, the probability of success does not change with each trial.
The hypergeometric distribution is used to determine the probability
of a specified number of successes and/or failures when (1) a
sample is selected from a finite population without replacement
and/or (2) when the sample size, n, is greater than or equal to 5%
of the population size, N, i.e., [ n>=5% N].
Note that by finite population we mean a population which consist
of a fixed number of known individuals, objects, or measurments.
For example, there were 489 applications for the nursing school at
Clayton State College in 1994. For detail discussion of the
hypergeometric probability distribution, refer to the text.

Introduction:

In lecture number four we said that a continuous random variable is

a variable which can take on any value over a given interval.
Continuous variables are measured, not counted. Items such as
height, weight and time are continuous and can take on fractional
values. For example, a basketball player may be 6.8432 feet tall.
There are many continuous probability distributions, such as,
uniform distribution, normal distribution, the t distribution, the chi-
square distribution, exponential distribution, and F distribution. In
mca-5 302

this lecture note, we will concentrate on the uniform distribution,

and normal distribution.

Uniform (or Rectangular) Distribution:

Among the continuous probability distribution, the uniform

distribution is the simplest one of all. The following figure shows an
example of a uniform distribution. In a uniform distribution, the area
under the curve is equal to the product of the length and the height
of the rectangle and equals to one.

Figure 1

where: a=lower limit of the range or interval, and b=upper limit of

the range or interval.

Note that in the above graph, since area of the rectangle =

(length)(height) =1, and since length = (b - a), thus we can write: (b
- a)(height) = 1 or height = f(X) = 1/(b - a).
The following equations are used to find the mean and standard
deviation of a uniform distribution:
mca-5 303

Example:

There are many cases in which we may be able to apply the

uniform distribution. As an example, suppose that the research
department of a steel factory believes that one of the company's
rolling machines is producing sheets of steel of different thickness.
The thickness is a uniform random variable with values between
150 and 200 millimeters. Any sheets less than 160 millimeters thick
must be scrapped because they are unacceptable to the buyers.
We want to calculate the mean and the standard deviation of the X
(the tickness of the sheet produced by this machine), and the
fraction of steel sheet produced by this machine that have to be
scrapped. The following figure displays the uniform distribution for
this example.

Figure 2

Note that for continuous distribution, probability is calculated by

finding the area under the function over a specific interval. In other
words, for continuous distributions, there is no probability at any
one point. The probability of X>= b or of X<= a is zero because
there is no area above b or below a, and area between a and b is
equal to one, see figure 1.

The probability of the variables falling between any two points, such
as c and d in figure 2, are calculated as follows:
P (c <= x <="d)" c)/(b a))=?
In this example c=a=150, d=160, and b=200, therefore:
Mean = (a + b)/2 = (150 + 200)/2 = 175 millimeters, standard
deviation is the square root of 208.3, which is equal to 14.43
millimeters, and P(c <= x <="d)" 150)/(200 150)="1/5" thus, of all
the sheets made by this machine, 20% of the production must be
scrapped.)=....

Normal Distribution or Normal Curve:

mca-5 304

Normal distribution is probably one of the most important and

widely used continuous distribution. It is known as a normal random
variable, and its probability distribution is called a normal
distribution. The following are the characteristics of the normal
distribution:

Characteristics of the Normal Distribution:

1. It is bell shaped and is symmetrical about its mean.

2. It is asymptotic to the axis, i.e., it extends indefinitely in either
direction from the mean.
3. It is a continuous distribution.
4. It is a family of curves, i.e., every unique pair of mean and
standard deviation defines a different normal distribution. Thus, the
normal distribution is completely described by two parameters:
mean and standard deviation. See the following figure.
5. Total area under the curve sums to 1, i.e., the area of the
distribution on each side of the mean is 0.5.
6. It is unimodal, i.e., values mound up only in the center of the
curve.
7. The probability that a random variable will have a value between
any two points is equal to the area under the curve between those
points.

Figure 3
mca-5 305

Note that the integral calculus is used to find the area under the
normal distribution curve. However, this can be avoided by
transforming all normal distribution to fit the standard normal
distribution. This conversion is done by rescalling the normal
distribution axis from its true units (time, weight, dollars, and...) to a
standard measure called Z score or Z value. A Z score is the
number of standard deviations that a value, X, is away from the
mean. If the value of X is greater than the mean, the Z score is
positive; if the value of X is less than the mean, the Z score is
negative. The Z score or equation is as follows:

Z = (X - Mean) /Standard deviation

A standard Z table can be used to find probabilities for any normal

curve problem that has been converted to Z scores. For the table,
refer to the text. The Z distribution is a normal distribution with a
mean of 0 and a standard deviation of 1.
The following steps are helpfull when working with the normal curve
problems:
1. Graph the normal distribution, and shade the area related to the
probability you want to find.
2. Convert the boundaries of the shaded area from X values to the
standard normal random variable Z values using the Z formula
above.
3. Use the standard Z table to find the probabilities or the areas
related to the Z values in step 2.

Example One:

Graduate Management Aptitude Test (GMAT) scores are widely

used by graduate schools of business as an entrance requirement.
Suppose that in one particular year, the mean score for the GMAT
was 476, with a standard deviation of 107. Assuming that the
GMAT scores are normally distributed, answer the following
questions:

Question 1. What is the probability that a randomly selected score

from this GMAT falls between 476 and 650? <= x <="650)" the
following figure shows a graphic representation of this problem.
mca-5 306

Figure 4

Applying the Z equation, we get: Z = (650 - 476)/107 = 1.62. The Z

value of 1.62 indicates that the GMAT score of 650 is 1.62 standard
deviation above the mean. The standard normal table gives the
probability of value falling between 650 and the mean. The whole
number and tenths place portion of the Z score appear in the first
column of the table. Across the top of the table are the values of the
hundredths place portion of the Z score. Thus the answer is that
0.4474 or 44.74% of the scores on the GMAT fall between a score
of 650 and 476.

Question 2. What is the probability of receiving a score greater than

750 on a GMAT test that has a mean of 476 and a standard
deviation of 107? i.e., P(X >= 750) = ?. This problem is asking for
determining the area of the upper tail of the distribution. The Z
score is: Z = ( 750 - 476)/107 = 2.56. From the table, the probability
for this Z score is 0.4948. This is the probability of a GMAT with a
score between 476 and 750. The rule is that when we want to find
the probability in either tail, we must substract the table value from
0.50. Thus, the answer to this problem is: 0.5 - 0.4948 = 0.0052 or
0.52%. Note that P(X >= 750) is the same as P(X >750), because,
in continuous distribution, the area under an exact number such as
X=750 is zero. The following figure shows a graphic representation
of this problem.

Figure 5
mca-5 307

Question 3. What is the probability of receiving a score of 540 or

less on a GMAT test that has a mean of 476 and a standard
deviation of 107? i.e., P(X <= 540)="?." we are asked to determine
the area under the curve for all values less than or equal to 540. the
z score is: z="(540" 476)/107="0.6." from the table, the probability
for this z score is 0.2257 which is the probability of getting a score
between the mean (476) and 540. the rule is that when we want to
find the probability between two values of x on either side of the
mean, we just add the two areas together. Thus, the answer to this
problem is: 0.5 + 0.2257 = 0.73 or 73%. The following figure shows
a graphic representation of this problem.

Figure 6

Question 4. What is the probability of receiving a score between

440 and 330 on a GMAT test that has a mean of 476 and a
standard deviation of 107? i.e., P(330
<>

Figure 7

In this problem, the two values fall on the same side of the mean.
The Z scores are: Z1 = (330 - 476)/107 = -1.36, and Z2 = (440 -
476)/107 = -0.34. The probability associated with Z = -1.36 is
0.4131, and the probability associated with Z = -0.34 is 0.1331. The
rule is that when we want to find the probability between two values
of X on one side of the mean, we just subtract the smaller area
mca-5 308

from the larger area to get the probability between the two values.
Thus, the answer to this problem is: 0.4131 - 0.1331 = 0.28 or 28%.

Example Two:

Suppose that a tire factory wants to set a mileage guarantee on its

new model called LA 50 tire. Life tests indicated that the mean
mileage is 47,900, and standard deviation of the normally
distributed distribution of mileage is 2,050 miles. The factory wants
to set the guaranteed mileage so that no more than 5% of the tires
will have to be replaced. What guaranteed mileage should the
factory announce? i.e., P(X <= ?)="5%.<br"> In this problem, the
mean and standard deviation are given, but X and Z are unknown.
The problem is to solve for an X value that has 5% or 0.05 of the X
values less than that value. If 0.05 of the values are less than X,
then 0.45 lie between X and the mean (0.5 - 0.05), see the
following graph.

Figure 8

Refer to the standard normal distribution table and search the body
of the table for 0.45. Since the exact number is not found in the
table, search for the closest number to 0.45. There are two values
equidistant from 0.45-- 0.4505 and 0.4495. Move to the left from
these values, and read the Z scores in the margin, which are: 1.65
and 1.64. Take the average of these two Z scores, i.e., (1.65 +
1.64)/2 = 1.645. Plug this number and the values of the mean and
the standard deviation into the Z equation, you get:
Z =(X - mean)/standard deviation or -1.645 =(X - 47,900)/2,050 =
44,528 miles.
Thus, the factory should set the guaranteed mileage at 44,528
miles if the objective is not to replace more than 5% of the tires.

The Normal Approximation to the Binomial Distribution:

In lecture note number 5 we talked about the binomial probability

distribution, which is a discrete distribution. You remember that we
mca-5 309

said as sample sizes get larger, binomial distribution approach the

normal distribution in shape regardless of the value of p (probability
of success). For large sample values, the binomial distribution is
cumbersome to analyze without a computer. Fortunately, the
normal distribution is a good approximation for binomial distribution
problems for large values of n. The commonly accepted guidelines
for using the normal approximation to the binomial probability
distribution is when (n x p) and [n(1 - p)] are both greater than 5.

Example:

Suppose that the management of a restaurant claimed that 70% of

their customers returned for another meal. In a week in which 80
new (first-time) customers dined at the restaurant, what is the
probability that 60 or more of the customers will return for another
meal?, ie., P(X >= 60) =?.

The solution to this problemcan can be illustrated as follows:

First, the two guidelines that (n x p) and [n(1 - p)] should be greater
than 5 are satisfied: (n x p) = (80 x 0.70) = 56 > 5, and [n(1 - p)] =
80(1 - 0.70) = 24 > 5. Second, we need to find the mean and the
standard deviation of the binomial distribution. The mean is equal to
(n x p) = (80 x 0.70) = 56 and standard deviation is square root of
[(n x p)(1 - p)], i.e., square root of 16.8, which is equal to 4.0988.
Using the Z equation we get, Z = (X - mean)/standard deviation =
(59.5 - 56)/4.0988 = 0.85. From the table, the probability for this Z
score is 0.3023 which is the probability between the mean (56) and
60. We must substract this table value 0.3023 from 0.5 in order to
get the answer, i.e., P(X >= 60) = 0.5 -0.3023 = 0.1977. Therefore,
the probability is 19.77% that 60 or more of the 80 first-time
customers will return to the restaurant for another meal. See the
following graph.

Figure 9

Correction Factor:
mca-5 310

The value 0.5 is added or subtracted, depending on the problem, to

the value of X when a binomial probability distribution is being
approximated by a normal distribution. This correction ensures that
most of the binomial problem's information is correctly transferred
to the normal curve analysis. This correction is called the correction
for continuity. The decision as to how to correct for continuity
depends on the equality sign and the direction of the desired
outcomes of the binomial distribution. The following table shows
some rules of thumb that can help in the application of the
correction for continuity, see the above example.

Value Being Determined..............................Correction

X >................................................+0.50
X > =..............................................-0.50
X <.................................................-0.50
X <=............................................+0.50
<= X <="...................................-0.50" & +0.50

X =.............................................-0.50 & +0.50

Expectation value

The expectation value of a function in a variable is denoted

or . For a single discrete variable, it is defined by

where is the probability function.

For a single continuous variable it is defined by,

The expectation value satisfies

For multiple discrete variables

For multiple continuous variables

mca-5 311

The (multiple) expectation value satisfies

where is the mean for the variable .

Uniform distribution

Trains arrive at a station at 15 minutes intervals starting at 4am.If a

passenger arrives at the station at a time that is uniformly
distributed between 9 and 9:30.Find the probability that he has to
wait for the train for

a)less than 6 minutes

b)more than 10 minutes

let X be the random variable representing the number of minutes

past 9 that the passenger arrives at the station.

a)he has to wait for less than 6 minutes if he arrives between 9:09
and 9:15 or between 9:24 and 9:30.

So the required probability =p(9<X<15)+p(24<X<30)

=2/5

b)he has to wait for more than 10 minutes if he arrives between

9:00and 9:05 or between 9:15 and 9:20

hence required probability =p(0<X<5) +p(15<x<20)=1/3

Have you understood?

1. A production process manufactures computer chips on an

average of 2% non-conforming. Every day a random sample size of
50 is taken from the process. If the sample contains more than two
non-conforming chips, the process will be stopped. Determine the
probability that the process is stopped by the sampling scheme.

Solution:
Here n=50; Bernoulli trials p=.02
Let X be the total number of non conformal chips
mca-5 312

P(x>2)=1-p(x<=2)
=1-{p(x=0)+p(x=1)+p(x=2)}
=1-[(.98) + 50(.02)(.98) +1225(.02)(.98) ]
=1-.922=.078
thus the probability that the process is stopped on any day, based
on the symphony process is approximately 0.078.

2. When a computer terminal repair person is beeped each time

there is a call for service. The number of beeps per hour is known
to occur in accordance with a poison distribution with a mean of
alpha=2 per hour. Determine the probability of the two or more
beeps in a 1-hour period.

3. A bus arrives every 20 minutes at a specified stop beginning at

6:40am and continues until 8:40am.A certain passenger does not
know the schedule, but arrives randomly(uniformly
distributed)between 7:00am and 7:30am every morning. What is
the probability that the passenger waits for more than 5 minutes for
a bus?

4. Harley Davidson, Director Quality control for the Kyoto motor

company, is conducting his monthly spot check of automatic
transmission. In this procedure,10 transmissions are removed from
the pool of components and are checked for manufacturing defects.
Historically only 2% of the transmissions have such flaws.(assume
that flaws occur independently in different transmissions)
a)What is the probability that Harley’s sample contains more than
two transmissions with manufacturing flaws?
b) What is the probability that none of the selected transmissions
has any manufacturing flaws?

5. The customer accounts of a certain departmental store have an

average balance of the Rs120/-and a standard deviation of Rs40/-
Assuming that the account balances are normally distributed, find
out
a)What proportion of accounts is more than Rs150/-?
b) What proportion of account is between Rs100/ and Rs150/-?
c) What proportion of account is between Rs60/- and Rs90/-?

6. A book contains 100 misprints distributed at random throughout

its 100 pages. What is the probability that a page observed at
random contains at least two misprints?(Assume Poisson
distribution)

7. the technical team says that on an average,3 hits of 10 million

hits made by the software fails.The marketing department requires
that a service level agreement on the Q.S that the probability of
occurrence of failure of 4 request hits failing amidst 10 million
requests is less than 15.
mca-5 313

A) Can the agreement be signed?

b)A technical up gradation at a higher cost can bring down the hit
failure rate from a mean of 3/10 million to 1/10 million. Is it
required?
8. Server crash, brought a down time of 19 minutes in a particular
environment. A user does an operation in it that gives a request,
once in 3 minutes. Find the probability that the number of requests
that fail is greater than 3, assuming that the problem is uniformly
distributed?

9. The response time for an application to send a request to

another application and get back a response in an enterprise
application interconnection was monitored by a tool for 3 months
continuously. The mean response time was found to be 600 milli
seconds with a standard deviation of 200 milli seconds for the
normally distributed random variable.
a) A response time of greater than 1.0 second is flagged as a
severity. Find the probability of occurrence of a severity.
b) Find the probability of a response time 800ms.

10)suppose a person who logs onto a particular sitein a shopping

mall on the world wide web purchases an item is .20.If the site has
10 people accessing it in the next minute,what is the probability that
a)none of the individuals purchase an item?
b)exactly 2 individuals purchases an item?
c)atleast 2 individuals purchase an item?
d)at most 2 individuals purchase an item?

Summary
This unit is extremely important from the point of view of many
fascinating aspects of statistical inference that would follow in the
subsequent units.Cetrainy, it is expected from you that you master
the nitty-gritty of this unit. This unit specifically focuses on
 The definition, meaning and concepts of a probability
distribution.
 The related terms-discrete random variable and continuous
random variable.
 Discrete probability distribution and continuous probability
distribution.
 The binomial distribution and its role in business problems.
 The Poisson distribution and its uses
 The normal distribution and its role in statistical inference.
 The concept of the standard normal distribution and its role.


mca-5 314















3
JOINT PROBABILITY
A joint probability table is a table in which all possible events
(or outcomes)for one variable are listed as row headings, all
possible events for a second variable are listed as column
headings, and the value entered in each cell of the table is the
probability of each joint occurrence. Often the probabilities in such
a table are based on observed frequencies of occurrence for the
various joint events, rather than being a priori in nature. The table
of joint occurrence frequencies which can serve as the basis for
constructing a joint probability table is called a contingency table.
Table 1 is a contingency table which describes 200 people who
entered a clothing store according to sex and age, while table 1b is
the associated joint probability table. The frequency reported in
each cell of the contingency table is converted into a probability
value by dividing by the total number of observations, in this case,
200

1a contingency table for clothing store customers

Sex Total
Age Male Female
Under 30 60 50 110
30 and
over 80 10 90
Total 140 60 200

1b joint probability table for clothing store customers

mca-5 315

Sex
Age Male Female Total
under 30 0.3 0.25 0.55
30 and
over 0.4 0.05 0.45
Marginal
probability 0.7 0.3 1

In the context of joint probability tables, a marginal probability is so

named because it is a marginal total of arrow or a column. Where
the probability values in the cells are probabilities of joint
occurrence, the marginal probabilities are the unconditional or
simple probabilities of particular events.

Table 2a is the contingency table which presents voter reactions to

a new property tax plan according to party affiliation. a)prepare the
joint probability table for these data.b)Determine the marginal
probabilities and indicate what they mean.

Contingency table for voter reactions to a new property tax plan

reaction
in
party affiliation favour neutral opposed toal
Democratic(d) 120 20 20 160
Republican® 50 30 60 140
Independent(i) 50 10 40 100
total 220 60 120 400

See table 2b
Joint probability table for voter reactions to a new property tax plan
reaction
in Marginal
party affiliation favour neutral opposed probability
Democratic(d) .30 .05 .05 .40
Republican® .125 .075 .15 .35
Independent(i) .125 .025 .10 .25
total .55 .15 .30 1.00

b) Each marginal probability value indicates the unconditional

probability of the event identified as the column or row heading. For
example, if a person is chosen randomly from this group of 400
voters, the probability that the person will be in favor of the tax plan
is p(f)==.55.If a voter is chosen randomly, the probability that the
voter is a republican is p®=.35
referring to the table,determine the following probabilities:
a)p(o)
b)p(r and o)
c)P(i)
mca-5 316

d)p(I and f)
e)p(o/r),(f)p(r/o)
g)p(r or d)
h)p(d or f)
solution
a)=.30 (the marginal probability)
b)=.15(joint probability)
c)=.25 (maginal probability)

Joint probability is the probability of two or more things happening

together. f(x, y | q) where f is the probability of x and y together as
a pair, given the distribution parameters, q. Often these events are
not independent, and sadly this is often ignored. Furthermore, the
correlation coefficient itself does NOT adequately describe these
interrelationships.

Consider first the idea of a probability density or distribution:

f(x | q) where f is the probability density of x, given the distribution
parameters, q. For a normal distribution, q = (m, s2)T where m is
the mean, and s is the standard deviation. This is sometimes called
a pdf, probability density function. The integral of a pdf, the area
under the curve (corresponding to the probability) between
specified values of x, is a cdf, cumulative distribution function, F(x |
q). For discrete f, F is the corresponding summation.

A joint probability density two or more variables is called a

multivariate distribution. It is often summarized by a vector of
parameters, which may or may not be sufficient to characterize the
distribution completely. Example, the normal is summarized
(sufficiently) by a mean vector and covariance matrix.

marginal probability: f(x | q) where f is the probability density of x,

for all possible values of y, given the distribution parameters, q. The
marginal probability is determined from the joint distribution of x and
y by integrating over all values of y, called "integrating out" the
variable y. In applications of Bayes's Theorem, y is often a matrix of
possible parameter values. The figure illustrates Joint, marginal,
and conditional probability.

 Schematic showing joint, marginal, and conditional densities

mca-5 317

conditional probability: f(x | y; q) where f is the probability of x by

itself, given specific value of variable y, and the distribution
parameters, q. (See Figure) If x and y represent events A and B,
then P(A|B) = nAB/nB , where nAB is the number of times both A and
B occur, and nB is the number of times B occurs. P(A|B) =
P(AB)/P(B), since P(AB) = nAB/N and P(B) = nB/N so that

Note that in general the conditional probability of A given B is not

the same as B given A. The probability of both A and B together is
P(AB), and P(A|B) X P(B) = P(AB) = P(B|A) X P(A), if both P(A) and
P(B) are non-zero. This leads to a statement of Bayes's Theorem:

P(B|A) = P(A|B) X P(B)/P(A). Conditional probability is also the

basis for statistical dependence and independence.

independence: Two variables, A and B, are independent if their

conditional probability is equal to their unconditional probability. In
other words, A and B are independent if, and only if, P(A|B)=P(A),
and P(B|A)=P(B). In engineering terms, A and B are independent if
knowing something about one tells nothing about the other. This is
the origin of the familiar, but often misused, formula P(AB) = P(A) X
P(B), which is true only when A and B are independent.
mca-5 318

conditional independence: A and B are conditionally independent,

given C, if
Prob(A=a, B=b | C=c) = Prob(A=a | C=c) X Prob(B=b | C=c)
whenever Prob(C=c) > 0. So the joint probability of ABC, when A
and B are conditionally independent, given C, is then Prob(C) X
Prob(A | C) X Prob(B | C) A directed graph illustrating this
conditional independence is A ¬ C ® B.

Correlation, Regression

Introduction

At this point, you know the basics, how to look at data, compute
and interpret probabilities draw a random sample, and to do
statistical inference. Now it’s a question of applying these concepts
to see the relationships hidden within the more complex situations
of real life. This unit shows you how statistics can summarize the
relationships between two factors based on a bivariate data set with
two columns of numbers. The correlation will tell you how strong
the relationship is, and regression will help you predict one factor
from the other.

Learning objectives
After reading this unit,you will be able to:
Define correlation coefficient with its properties
Calculate correlation coefficient and interpret
Appreciate the role of regression
Formulate the regression equation and use it for estimation and
prediction.

Correlation analysis

Pearson's product-moment coefficient

Mathematical properties

The correlation coefficient ρX, Y between two random variables X

and Y with expected values μX and μY and standard deviations σX
and σY is defined as:

where E is the expected value operator and cov means covariance.

Since μX = E(X), σX2 = E(X2) − E2(X) and likewise for Y, we may
also write

The correlation is defined only if both of the standard deviations are

finite and both of them are nonzero. It is a corollary of the Cauchy-
Schwarz inequality that the correlation cannot exceed 1 in absolute
value.
mca-5 319

The correlation is 1 in the case of an increasing linear relationship,

−1 in the case of a decreasing linear relationship, and some value
in between in all other cases, indicating the degree of linear
dependence between the variables. The closer the coefficient is to
either −1 or 1, the stronger the correlation between the variables.

If the variables are independent then the correlation is 0, but the

converse is not true because the correlation coefficient detects only
linear dependencies between two variables. Here is an example:
Suppose the random variable X is uniformly distributed on the
interval from −1 to 1, and Y = X2. Then Y is completely determined
by X, so that X and Y are dependent, but their correlation is zero;
they are uncorrelated. However, in the special case when X and Y
are jointly normal, uncorrelatedness is equivalent to independence.

A correlation between two variables is diluted in the presence of

measurement error around estimates of one or both variables, in
which case disattenuation provides a more accurate coefficient.

The sample correlation

If we have a series of n measurements of X and Y written as xi

and yi where i = 1, 2, ..., n, then the Pearson product-moment
correlation coefficient can be used to estimate the correlation of X
and Y . The Pearson coefficient is also known as the "sample
correlation coefficient". The Pearson correlation coefficient is then
the best estimate of the correlation of X and Y . The Pearson
correlation coefficient is written:

where and are the sample means of X and Y , sx and sy are the
sample standard deviations of X and Y and the sum is from i = 1
to n. As with the population correlation, we may rewrite this as

Again, as is true with the population correlation, the absolute value

of the sample correlation must be less than or equal to 1. Though
the above formula conveniently suggests a single-pass algorithm
for calculating sample correlations, it is notorious for its numerical
instability (see below for something more accurate).

The square of the sample correlation coefficient, which is also

known as the coefficient of determination, is the fraction of the
variance in yi that is accounted for by a linear fit of xi to yi . This is
written

where sy|x2 is the square of the error of a linear regression of xi on

yi by the equation y = a + bx:

and sy2 is just the variance of y:

mca-5 320

Note that since the sample correlation coefficient is symmetric in xi

and yi , we will get the same value for a fit of yi to xi :

This equation also gives an intuitive idea of the correlation

coefficient for higher dimensions. Just as the above described
sample correlation coefficient is the fraction of variance accounted
for by the fit of a 1-dimensional linear submanifold to a set of 2-
dimensional vectors (xi , yi ), so we can define a correlation
coefficient for a fit of an m-dimensional linear submanifold to a set
of n-dimensional vectors. For example, if we fit a plane z = a + bx +
cy to a set of data (xi , yi , zi ) then the correlation coefficient of z to
x and y is

The distribution of the correlation coefficient has been examined by

R. A. Fisher[1][2] and A. K. Gayen.[3]

Geometric Interpretation of correlation

The correlation coefficient can also be viewed as the cosine of the

angle between the two vectors of samples drawn from the two
random variables.

Caution: This method only works with centered data, i.e., data
which have been shifted by the sample mean so as to have an
average of zero. Some practitioners prefer an uncentered (non-
Pearson-compliant) correlation coefficient. See the example below
for a comparison.

As an example, suppose five countries are found to have gross

national products of 1, 2, 3, 5, and 8 billion dollars, respectively.
Suppose these same five countries (in the same order) are found to
have 11%, 12%, 13%, 15%, and 18% poverty. Then let x and y be
ordered 5-element vectors containing the above data: x = (1, 2, 3,
5, 8) and y = (0.11, 0.12, 0.13, 0.15, 0.18).

By the usual procedure for finding the angle between two vectors
(see dot product), the uncentered correlation coefficient is:

Note that the above data were deliberately chosen to be perfectly

correlated: y = 0.10 + 0.01 x. The Pearson correlation coefficient
must therefore be exactly one. Centering the data (shifting x by
E(x) = 3.8 and y by E(y) = 0.138) yields x = (-2.8, -1.8, -0.8, 1.2,
4.2) and y = (-0.028, -0.018, -0.008, 0.012, 0.042), from which

as expected.

Motivation for the form of the coefficient of correlation

mca-5 321

Another motivation for correlation comes from inspecting the

method of simple linear regression. As above, X is the vector of
independent variables, xi, and Y of the dependent variables, yi, and
a simple linear relationship between X and Y is sought, through a
least-squares method on the estimate of Y:

Then, the equation of the least-squares line can be derived to be of

the form:

which can be rearranged in the form:

where r has the familiar form mentioned above :

Interpretation of the size

of a correlation
Correlation Negative Positive
Several authors have
offered guidelines for the Small −0.29 to −0.10 0.10 to 0.29
interpretation of a Medium −0.49 to −0.30 0.30 to 0.49
correlation coefficient.
Large −1.00 to −0.50 0.50 to 1.00
Cohen (1988),[4] for
example, has suggested
the following interpretations for correlations in psychological
research, in the table on the right.

As Cohen himself has observed, however, all such criteria are in

some ways arbitrary and should not be observed too strictly. This is
because the interpretation of a correlation coefficient depends on
the context and purposes. A correlation of 0.9 may be very low if
one is verifying a physical law using high-quality instruments, but
may be regarded as very high in the social sciences where there
may be a greater contribution from complicating factors.

Along this vein, it is important to remember that "large" and "small"

should not be taken as synonyms for "good" and "bad" in terms of
determining that a correlation is of a certain size. For example, a
correlation of 1.00 or −1.00 indicates that the two variables
analyzed are equivalent modulo scaling. Scientifically, this more
frequently indicates a trivial result than an earth-shattering one. For
example, consider discovering a correlation of 1.00 between how
many feet tall a group of people are and the number of inches from
the bottom of their feet to the top of their heads.

Non-parametric correlation coefficients

Pearson's correlation coefficient is a parametric statistic and when

distributions are not normal it may be less useful than non-
parametric correlation methods, such as Chi-square, Point biserial
correlation, Spearman's ρ and Kendall's τ. They are a little less
mca-5 322

powerful than parametric methods if the assumptions underlying

the latter are met, but are less likely to give distorted results when
the assumptions fail.

Other measures of dependence among random variables

To get a measure for more general dependencies in the data (also

nonlinear) it is better to use the correlation ratio which is able to
detect almost any functional dependency, or mutual
information/total correlation which is capable of detecting even
more general dependencies.

The polychoric correlation is another correlation applied to ordinal

data that aims to estimate the correlation between theorised latent
variables.

Copulas and correlation

The information given by a correlation coefficient is not enough to

define the dependence structure between random variables; to fully
capture it we must consider a copula between them. The
correlation coefficient completely defines the dependence structure
only in very particular cases, for example when the cumulative
distribution functions are the multivariate normal distributions. In the
case of elliptic distributions it characterizes the (hyper-)ellipses of
equal density, however, it does not completely characterize the
dependence structure (for example, the a multivariate t-
distribution's degrees of freedom determine the level of tail
dependence).

Correlation matrices

The correlation matrix of n random variables X1, ..., Xn is the n × n

matrix whose i,j entry is corr(Xi, Xj). If the measures of correlation
used are product-moment coefficients, the correlation matrix is the
same as the covariance matrix of the standardized random
variables Xi /SD(Xi) for i = 1, ..., n. Consequently it is necessarily a
positive-semidefinite matrix.

The correlation matrix is symmetric because the correlation

between Xi and Xj is the same as the correlation between Xj and Xi.

Removing correlation

It is always possible to remove the correlation between zero-mean

random variables with a linear transform, even if the relationship
between the variables is nonlinear. Suppose a vector of n random
variables is sampled m times. Let X be a matrix where Xi,j is the jth
variable of sample i. Let Zr,c be an r by c matrix with every element
mca-5 323

1. Then D is the data transformed so every random variable has

zero mean, and T is the data transformed so all variables have zero
mean, unit variance, and zero correlation with all other variables.
The transformed variables will be uncorrelated, even though they
may not be independent.

where an exponent of -1/2 represents the matrix square root of the

inverse of a matrix. The covariance matrix of T will be the identity
matrix. If a new data sample x is a row vector of n elements, then
the same transform can be applied to x to get the transformed
vectors d and t:

Common misconceptions about correlation

Correlation and causality

The conventional dictum that "correlation does not imply causation"

means that correlation cannot be validly used to infer a causal
relationship between the variables. This dictum should not be taken
to mean that correlations cannot indicate causal relations.
However, the causes underlying the correlation, if any, may be
indirect and unknown. Consequently, establishing a correlation
between two variables is not a sufficient condition to establish a
causal relationship (in either direction).

Here is a simple example: hot weather may cause both crime and
ice-cream purchases. Therefore crime is correlated with ice-cream
purchases. But crime does not cause ice-cream purchases and ice-
cream purchases do not cause crime.

A correlation between age and height in children is fairly causally

transparent, but a correlation between mood and health in people is
less so. Does improved mood lead to improved health? Or does
good health lead to good mood? Or does some other factor
underlie both? Or is it pure coincidence? In other words, a
correlation can be taken as evidence for a possible causal
relationship, but cannot indicate what the causal relationship, if any,
might be.

Correlation and linearity

Four sets of data with the same correlation of 0.81

While Pearson correlation indicates the strength of a linear

relationship between two variables, its value alone may not be
sufficient to evaluate this relationship, especially in the case where
the assumption of normality is incorrect.
mca-5 324

The image on the right shows scatterplots of Anscombe's quartet, a

set of four different pairs of variables created by Francis
Anscombe.[5] The four y variables have the same mean (7.5),
standard deviation (4.12), correlation (0.81) and regression line (y =
3 + 0.5x). However, as can be seen on the plots, the distribution of
the variables is very different. The first one (top left) seems to be
distributed normally, and corresponds to what one would expect
when considering two variables correlated and following the
assumption of normality. The second one (top right) is not
distributed normally; while an obvious relationship between the two
variables can be observed, it is not linear, and the Pearson
correlation coefficient is not relevant. In the third case (bottom left),
the linear relationship is perfect, except for one outlier which exerts
enough influence to lower the correlation coefficient from 1 to 0.81.
Finally, the fourth example (bottom right) shows another example
when one outlier is enough to produce a high correlation coefficient,
even though the relationship between the two variables is not
linear.

These examples indicate that the correlation coefficient, as a

summary statistic, cannot replace the individual examination of the
data.

Computing correlation accurately in a single pass

The following algorithm (in pseudocode) will estimate correlation

with good numerical stability

sum_sq_x = 0
sum_sq_y = 0
sum_coproduct = 0
mean_x = x[1]
mean_y = y[1]
for i in 2 to N:
sweep = (i - 1.0) / i
delta_x = x[i] - mean_x
delta_y = y[i] - mean_y
sum_sq_x += delta_x * delta_x * sweep
sum_sq_y += delta_y * delta_y * sweep
sum_coproduct += delta_x * delta_y * sweep
mean_x += delta_x / i
mean_y += delta_y / i
pop_sd_x = sqrt( sum_sq_x / N )
pop_sd_y = sqrt( sum_sq_y / N )
cov_x_y = sum_coproduct / N
mca-5 325

correlation = cov_x_y / (pop_sd_x * pop_sd_y)

For an enlightening experiment, check the correlation of

{900,000,000 + i for i=1...100} with {900,000,000 - i for i=1...100},
perhaps with a few values modified. Poor algorithms will fail.

Autocorrelation

A plot showing 100 random numbers with a "hidden" sine function,

and an autocorrelation (correlogram) of the series on the bottom.

Autocorrelation is a mathematical tool used frequently in signal

processing for analyzing functions or series of values, such as time
domain signals. Informally, it is the strength of a relationship
between observations as a function of the time separation between
them. More precisely, it is the cross-correlation of a signal with
itself. Autocorrelation is useful for finding repeating patterns in a
signal, such as determining the presence of a periodic signal which
has been buried under noise, or identifying the missing
fundamental frequency in a signal implied by its harmonic
frequencies.

Uses of correlation

The uses of correlation are as follows:

Economic theory and business studies show relationship between

variable like price and quantity demanded,advertising expenditure
and sales promotion measures etc.

Correlation analysis helps in deriving precisely the degree and the

direction of such relationships.

The effect of correlation is to reduce the range of uncertanity of our

prediction.The prediction based on correlation analysis will be more
reliable and near to reality.

Regression analysis

In regression analysis using time series data, autocorrelation of the

residuals ("error terms", in econometrics) is a problem.

Autocorrelation violates the OLS assumption that the error terms

are uncorrelated. While it does not bias the OLS coefficient
estimates, the standard errors tend to be underestimated (and the
t-scores overestimated).
mca-5 326

The traditional test for the presence of first-order autocorrelation is

the Durbin–Watson statistic or, if the explanatory variables include
a lagged dependent variable, Durbin's h statistic. A more flexible
test, covering autocorrelation of higher orders and applicable
whether or not the regressors include lags of the dependent
variable, is the Breusch–Godfrey test. This involves an auxiliary
regression, wherein the residuals obtained from estimating the
model of interest are regressed on (a) the original regressors and
(b) k lags of the residuals, where k is the order of the test. The
simplest version of the test statistic from this auxiliary regression is
TR2, where T is the sample size and R2 is the coefficient of
determination. Under the null hypothesis of no autocorrelation, this
statistic is asymptotically distributed as Χ2 with k degrees of
freedom.

Responses to nonzero autocorrelation include generalized least

squares and Newey–West standard errors.

Applications

 One application of autocorrelation is the measurement of

optical spectra and the measurement of very-short-duration
light pulses produced by lasers, both using optical
autocorrelators.

 In optics, normalized autocorrelations and cross-correlations

give the degree of coherence of an electromagnetic field.

 In signal processing, autocorrelation can give information

about repeating events like musical beats or pulsar
frequencies, though it cannot tell the position in time of the
beat. It can also be used to estimate the pitch of a musical
tone.

Correlation Coefficient

How well does your regression equation truly represent

your set of data?
One of the ways to determine the answer to this question is to
exam the correlation coefficient and the coefficient of
determination.
mca-5 327

The correlation coefficient, r, and

the coefficient of determination, r 2 ,
will appear on the screen that shows the
regression equation information
(be sure the Diagnostics are turned on -
--
2nd Catalog (above 0), arrow down to
DiagnosticOn, press ENTER twice.)

In addition to appearing with the regression

information, the values r and r 2 can be found under
VARS, #5 Statistics → EQ #7 r and #8 r 2 .
Correlation Coefficient, r :
The quantity r, called the linear correlation coefficient, measures
the strength and
the direction of a linear relationship between two variables. The
linear correlation
coefficient is sometimes referred to as the Pearson product
moment correlation coefficient in
honor of its developer Karl Pearson.
The mathematical formula for computing r is:

where n is the number of pairs of data.

(Aren't you glad you have a graphing calculator that
computes this formula?)
The value of r is such that -1 < r < +1. The + and – signs are
used for positive
linear correlations and negative linear correlations,
respectively.
Positive correlation: If x and y have a strong positive linear
correlation, r is close
to +1. An r value of exactly +1 indicates a perfect positive fit.
Positive values
indicate a relationship between x and y variables such that as
values for x increases,
values for y also increase.
Negative correlation: If x and y have a strong negative linear
correlation, r is close
to -1. An r value of exactly -1 indicates a perfect negative fit.
Negative values
indicate a relationship between x and y such that as values for x
increase, values
for y decrease.
No correlation: If there is no linear correlation or a weak linear
correlation, r is
mca-5 328

close to 0. A value near zero means that there is a random,

nonlinear relationship
between the two variables
Note that r is a dimensionless quantity; that is, it does not
depend on the units
employed.
A perfect correlation of ± 1 occurs only when the data points all
lie exactly on a
straight line. If r = +1, the slope of this line is positive. If r = -1,
the slope of this
line is negative.
A correlation greater than 0.8 is generally described as strong,
whereas a correlation
less than 0.5 is generally described as weak. These values can
vary based upon the
"type" of data being examined. A study utilizing scientific data
may require a stronger
correlation than a study using social science data.

Coefficient of Determination, r 2 or R2 :

The coefficient of determination, r 2, is useful because it gives

the proportion of
the variance (fluctuation) of one variable that is predictable from
the other variable.
It is a measure that allows us to determine how certain one can
be in making
predictions from a certain model/graph.
The coefficient of determination is the ratio of the explained
variation to the total
variation.
The coefficient of determination is such that 0 < r 2 < 1, and
denotes the strength
of the linear association between x and y.
The coefficient of determination represents the percent of the
data that is the closest
to the line of best fit. For example, if r = 0.922, then r 2 = 0.850,
which means that
85% of the total variation in y can be explained by the linear
relationship between x
and y (as described by the regression equation). The other
15% of the total variation
in y remains unexplained.
The coefficient of determination is a measure of how well the
regression line
represents the data. If the regression line passes exactly
through every point on the
scatter plot, it would be able to explain all of the variation. The
mca-5 329

further the line is

away from the points, the less it is able to explain.

Correlation Coefficient

A correlation coefficient is a number between -1 and 1 which

measures the degree to which two variables are linearly related. If
there is perfect linear relationship with positive slope between the
two variables, we have a correlation coefficient of 1; if there is
positive correlation, whenever one variable has a high (low) value,
so does the other. If there is a perfect linear relationship with
negative slope between the two variables, we have a correlation
coefficient of -1; if there is negative correlation, whenever one
variable has a high (low) value, the other has a low (high) value. A
correlation coefficient of 0 means that there is no linear relationship
between the variables.

There are a number of different correlation coefficients that might

be appropriate depending on the kinds of variables being studied.

Pearson's Product Moment Correlation Coefficient

Pearson's product moment correlation coefficient, usually denoted

by r, is one example of a correlation coefficient. It is a measure of
the linear association between two variables that have been
measured on interval or ratio scales, such as the relationship
between height in inches and weight in pounds. However, it can be
misleadingly small when there is a relationship between the
variables but it is a non-linear one.

There are procedures, based on r, for making inferences about the

population correlation coefficient. However, these make the implicit
assumption that the two variables are jointly normally distributed.
When this assumption is not justified, a non-parametric measure
such as the Spearman Rank Correlation Coefficient might be more
appropriate.

Spearman Rank Correlation Coefficient

The Spearman rank correlation coefficient is one example of a

correlation coefficient. It is usually calculated on occasions when it
is not convenient, economic, or even possible to give actual values
to variables, but only to assign a rank order to instances of each
variable. It may also be a better indicator that a relationship exists
between two variables when the relationship is non-linear.

Commonly used procedures, based on the Pearson's Product

Moment Correlation Coefficient, for making inferences about the
population correlation coefficient make the implicit assumption that
mca-5 330

the two variables are jointly normally distributed. When this

assumption is not justified, a non-parametric measure such as the
Spearman Rank Correlation Coefficient might be more appropriate.

have outliers, Pearson's correlation coefficient will be greatly

affected. Also, Pearson's correlation coefficient only measures
linear relationships between variables. There is another alternative.
Spearman's rank correlation coefficient does not use the actual
observed data, but the ranks of the data, to compute a correlation
coefficient. That is, replace the smallest X value with a 1, the next
smallest with a 2, and so on. Repeat the same procedure for the Y
values. Then instead of having our data (X,Y) in the form

they will be as follows

The formula for is as follows,

Actually, this is just Pearson's formula applied to the ranks.

Correlation Coefficients: Examples

Spearman's Rank Correlation Coefficient

In calculating this coefficient, we use the Greek letter 'rho' or r

The formula used to calculate this coefficient is:

r = 1 - (6 d2 ) / n(n2 - 1)

To illustrate this, consider the following worked example:

Researchers at the European Centre for Road Safety Testing are
trying to find out how the age of cars affects their braking capability.
They test a group of ten cars of differing ages and find out the
minimum stopping distances that the cars can achieve. The results
are set out in the table below:

Table 1: Car ages and stopping distances

Car Age Minimum Stopping at 40

(months) kph
(metres)
A 9 28.4
mca-5 331

B 15 29.3
C 24 37.6
D 30 36.2
E 38 36.5
F 46 35.3
G 53 36.2
H 60 44.1
I 64 44.8
J 76 47.2

These figures form the basis for the scatter diagram, below, which
shows a reasonably strong positive correlation - the older the car,
the longer the stopping distance.

Graph 1: Car age and Stopping distance (data from Table 1

above)

To process this information we must, firstly, place the ten pieces of

data into order, or rank them according to their age and ability to
stop. It is then possible to process these ranks.

Table 2: Ranked data from Table 1 above

Car Age Minimum Age Stopping

(months) Stopping rank rank
at 40 kph
(metres)
A 9 28.4 1 1
B 15 29.3 2 2
C 24 37.6 3 7
mca-5 332

D 30 36.2 4 4.5
E 38 36.5 5 6
F 46 35.3 6 3
G 53 36.2 7 4.5
H 60 44.1 8 8
I 64 44.8 9 9
J 76 47.2 10 10

Notice that the ranking is done here in such a way that the youngest
car and the best stopping performance are rated top and vice versa.
There is no strict rule here other than the need to be consistent in
your rankings. Notice also that there were two values the same in
terms of the stopping performance of the cars tested. They occupy
'tied ranks' and must share, in this case, ranks 4 and 5. This means
they are each ranked as 4.5, which is the mean average of the two
ranking places. It is important to remember that this works despite
the number of items sharing tied ranks. For instance, if five items
shared ranks 5, 6, 7, 8 and 9, then they would each be ranked 7 -
the mean of the tied ranks.

Now we can start to process these ranks to produce the following

table:

Table 3: Differential analysis of data from Table 2

Car Age Stopping Age Stopping d d2

(mths) distance rank rank
A 9 28.4 1 1 0 0
B 15 29.3 2 2 0 0
C 24 37.6 3 7 4 16
D 30 36.2 4 4.5 0.5 0.25
E 38 36.5 5 6 1 1
F 46 35.3 6 3 -3 9
G 53 36.2 7 4.5 - 6.25
2.5
H 60 44.1 8 8 0 0
I 64 44.8 9 9 0 0
J 76 47.2 10 10 0 0
32.5
d2

Note that the two extra columns introduced into the new table are
Column 6, 'd', the difference between stopping distance rank and
mca-5 333

squared figures are summed at the foot of Column 7.

Calculation of Spearman Rank Correlation Coefficient (r) is:
r = 1 - (6 d2 ) / n(n2 - 1)
Number in sample (n) = 10
r = 1 - (6 x 32.5) / 10(10 x 10 - 1)
r = 1 - (195 / 10 x 99)
r = 1 - 0.197
r = 0.803

What does this tell us? When interpreting the Spearman Rank
Correlation Coefficient, it is usually enough to say that:

 for values of r of 0.9 to 1, the correlation is very strong.

 for values between 0.7 and 0.9, correlation is strong.
 and for values between 0.5 and 0.7, correlation is moderate.

This is the case whether r is positive or negative.

In our case of car ages and stopping distance performance, we can
say that there is a strong correlation between the two variables.

Pearson's or Product-Moment Correlation Coefficient

The Pearson Correlation Coefficient is denoted by the symbol r. Its

formula is based on the standard deviations of the x-values and the
y-values:

Going back to the original data we recorded from the European

Centre for Road Safety Testing, the calculation needed for us to
work out the Product-Moment Correlation Coefficient is best set out
as in the table that follows.

Note that in the table below,

x = age of car
y = stopping distance

From this, the other notation should be obvious.

x y x2 y2 xy
9 28.4 81 806.56 255.6
15 29.3 225 858.49 439.5
24 37.6 576 1413.76 902.4
mca-5 334

30 36.2 900 1310.44 1086

38 36.5 1444 1332.25 1387
46 35.3 2116 1246.09 1623.8
53 36.2 2809 1310.44 1918.6
60 44.1 3600 1944.81 2646
64 44.8 4096 2007.04 2867.2
76 47.2 5776 2227.84 3587.2
Totals 415 375.6 21623 14457.72 16713.3

x-bar = 415/10 = 41.5

y-bar = 376.6/10 = 37.7

r = 10 x 16713.3 - 415 x 375.6 / {(10 x 21623 - 4152) (10 x 14457.72

- 375.62)}
r = 11259 / (44005 x 3501.84)
r = 11259 / 124.14
r = 0.91

What does this tell us?

To interpret the value of r you need to follow these guidelines:

 r always lies in the range -1 to +1. If it lies close to either of

these two values, then the dispersion of the scattergram points
is small and therefore a strong correlation exists between the
two variables.
 For r to equal exactly -1 or +1 must mean that correlation is
perfect and all the points on the scattergram lie on the line of
best fit (otherwise known as the regression line.) If r is close to
0, the dispersion is large and the variables are uncorrelated.
The positive or negative sign on the value of r indicates
positive or negative correlation.

So in the above case, there is evidence of strong positive correlation

between stopping distance and age of car; in other words, the older
the car, the longer the distance we could expect it to take to stop.

Illustration:

Let's say that we want to track the progress of a group of new

employees of a large service organisation. We think we can judge
the effectiveness of our induction and initial training scheme by
analysing employee competence in weeks one, four and at the end
of the six months.

Let's say that Human Resource managers in their organisation have

mca-5 335

been urging the company to commit more resources to induction

and basic training. The company now wishes to know which of the
two assessments - the new employee's skills on entry or after week
four - provides a better guide to the employee's performance after
six months. Although there is a small sample here, let's assume that
it is accurate.

The raw data is given in the table below:

Name Skills on entry Skills at week Skills at 6 mths

% score 4 % score % score
ab 75 75 75
bc 72 69 76
cd 82 76 83
de 78 77 65
ef 86 79 85
fg 76 65 79
gh 86 82 65
hi 89 78 75
ij 83 70 80
jk 65 71 70

Copy this information onto a fresh Excel worksheet, putting the

names in Column A, the entry test results in Column B, the week
four test Marks in Column D, and the six month test scores in
Column F.

When you have entered the information, select the three number
columns (do not include any cells with words in them). Go to the
Data Analysis option on the Tools menu, select from that Data
Analysis menu the item Correlation (note that if the Data Analysis
option is not on the Tools menu you have to add it in).

When you get the Correlation menu, enter in the first Input Range
box the column of cells containing the dependent variables you wish
to analyze (D3 to D12 if your spreadsheet looks like TimeWeb's).
Next, enter into the second input box the column of cells that contain
the independent variables (B3 to B12, again if your sheet resembles
TimeWeb's).

Then click the mouse pointer in the circle to the left of the Output
Range label (unless there is a black dot in it already), and click the
left mouse button in the Output Range box. Then enter the name of
cell where you want the top left corner of the correlation table to
appear (e.g., $A$14). Then click OK.
mca-5 336

After a second or two, the Correlation Table should appear giving

you the correlation between all the different pairs of data. We are
interested in the correlation between Column B (the first column in
the Table) and Column D (the third column in the table). The
correlation between Column C (the second column in the Table) and
Column D, can be approached in the same way.

Which of these two is the better predictor of success according to

this study. How reliable is it?

Expected Answer:

The correlation between the Entry Mark and the Final Mark is 0.23;
the correlation between the four week test and the Final Mark is
0.28. Thus, both of the tests have a positive correlation to the Final
(6 month) Test; the entry test has a slightly weaker positive
correlation with the Final Mark, than the Four Week Test. However,
both figures are so low, that the correlation is minimal. The skills
measured by the Entry test account for about 5 per cent of the skills
measured by the Six Month Mark. This figure is obtained by using
the R-Squared result and expressing it as a percentage.

Beware!

It's vital to remember that a correlation, even a very strong one, does
not mean we can make a conclusion about causation. If, for
example, we find a very high correlation between the weight of a
baby at birth and educational achievement at age 25, we may make
some predictions about the numbers of people staying on at
university to study for post-graduate qualifications. Or we may urge
mothers-to-be to take steps to boost the weight of the unborn baby,
because the heavier their baby the higher their baby's educational
potential, but we should be aware that the correlation, in itself, is no
proof of these assertions.

This is a really important principle: correlation is not necessarily

proof of causation. It indicates a relationship which may be based on
cause and effect, but then again, it may not be. If weight at birth is a
major cause of academic achievement, then we can expect that
variations in birth weight will cause changes in achievement. The
reverse, however is not necessarily true. If any two variables are
correlated, we cannot automatically assume that one is the cause of
the other.

The point of causation is best illustrated perhaps, using the example

of AIDS.

A very high correlation exists between HIV infection and cases of

mca-5 337

AIDS. This has caused many researchers to believe that HIV is the
principal cause of AIDS. This belief has led to most of the money for
AIDS research going into investigating HIV.

But the cause of AIDS is still not clear. Some people (especially, not
surprisingly, those suffering from AIDS) have argued vehemently
that investigating HIV instead of AIDS is a mistake. They say that
something else is the real cause. This is the area, they argue, that
requires greater research funding. More money should be going into
AIDS research rather than studies into HIV.

Correlation and Linearity

Correlation coefficients measure the strength of association

between two variables. The most common correlation coefficient,
called the Pearson product-moment correlation coefficient,
measures the strength of the linear association between variables.

In this tutorial, when we speak simply of a correlation coefficient,

we are referring to the Pearson product-moment correlation.

How to Calculate a Correlation Coefficient

A formula for computing a sample correlation coefficient (r) is given

below.

Sample correlation coefficient. The correlation r between two

variables is:

r = [ 1 / (n - 1) ] * Σ { [ (xi - x) / sx ] * [ (yi - y) / sy ] }

where n is the number of observations in the sample, Σ is the

summation symbol, xi is the x value for observation i, x is the mean
x value, yi is the y value for observation i, y is the mean y value, sx
is the sample standard deviation of x, and sy is the sample standard
deviation of y.

A formula for computing a population correlation coefficient (ρ) is

given below.

Population correlation coefficient. The correlation ρ between two

variables is:

ρ = [ 1 / N ] * Σ { [ (Xi - μX) / σx ] * [ (Yi - μY) / σy ] }

where N is the number of observations in the population, Σ is the

summation symbol, Xi is the X value for observation i, μX is the
mca-5 338

population mean for variable X, Yi is the Y value for observation i,

μY is the population mean for variable Y, σx is the standard
deviation of X, and σy is the standard deviation of Y.

Fortunately, you will rarely have to compute a correlation coefficient

by hand. Many software packages (e.g., Excel) and most graphing
calculators have a correlation function that will do the job for you.

Note: Sometimes, it is not clear whether a software package or a

graphing calculator uses a population correlation coefficient or a
sample correlation coefficient. For example, a casual user might not
realize that Microsoft uses a population correlation coefficient (ρ)
for the Pearson() function in its Excel software.

How to Interpret a Correlation Coefficient

The sign and the absolute value of a correlation coefficient describe

the direction and the magnitude of the relationship between two
variables.

 The value of a correlation coefficient ranges between -1 and

1.
 The greater the absolute value of a correlation coefficient,
the stronger the linear relationship.
 The strongest linear relationship is indicated by a correlation
coefficient of -1 or 1.
 The weakest linear relationship is indicated by a correlation
coefficient equal to 0.
 A positive correlation means that if one variable gets bigger,
the other variable tends to get bigger.
 A negative correlation means that if one variable gets bigger,
the other variable tends to get smaller.

Keep in mind that the Pearson product-moment correlation

coefficient only measures linear relationships. Therefore, a
correlation of 0 does not mean zero relationship between two
variables; rather, it means zero linear relationship. (It is possible for
two variables to have zero linear relationship and a strong
curvilinear relationship at the same time.)

Scatterplots and Correlation Coefficients

The scatterplots below show how different patterns of data produce

different degrees of correlation.
mca-5 339

Maximum positive Strong positive Zero correlation

correlation (r = 1.0) correlation (r = 0.80) (r = 0)

Minimum negative Moderate negative Strong

correlation (r = -1.0) correlation (r = -0.43) correlation with
outlier (r = 0.71)

Several points are evident from the scatterplots.

 When the slope of the line in the plot is negative, the

correlation is negative; and vice versa.
 The strongest correlations (r = 1.0 and r = -1.0 ) occur when
data points fall exactly on a straight line.
 The correlation becomes weaker as the data points become
more scattered.
 If the data points fall in a random pattern, the correlation is
equal to zero.
 Correlation is affected by outliers. Compare the first
scatterplot with the last scatterplot. The single outlier in the
last plot greatly reduces the correlation (from 1.00 to 0.71).

Test Your Understanding of This Lesson

Problem 1

A national consumer magazine reported the following correlations.

 The correlation between car weight and car reliability is -

0.30.
 The correlation between car weight and annual maintenance
cost is 0.20.

Which of the following statements are true?

mca-5 340

I. Heavier cars tend to be less reliable.

II. Heavier cars tend to cost more to maintain.
III. Car weight is related more strongly to reliability than to
maintenance cost.

(A) I only
(B) II only
(C) III only
(D) I and II only
(E) I, II, and III

Solution

The correct answer is (E). The correlation between car weight and
reliability is negative. This means that reliability tends to decrease
as car weight increases. The correlation between car weight and
maintenance cost is positive. This means that maintenance costs
tend to increase as car weight increases.

The strength of a relationship between two variables is indicated by

the absolute value of the correlation coefficient. The correlation
between car weight and reliability has an absolute value of 0.30.
The correlation between car weight and maintenance cost has an
absolute value of 0.20. Therefore, the relationship between car
weight and reliability is stronger than the relationship between car
weight and maintenance cost.

Least Squares

The method of least squares is a criterion for fitting a specified

model to observed data. For example, it is the most commonly
used method of defining a straight line through a set of points on a
scatterplot.

Least Squares Line

Recall the equation of a line from algebra:

(You may have seen Y=mX+b, we are going to change notation

slightly.) Above, is called the slope of the line and is the y-
intercept. The slope measures the amount Y increases when X
increases by one unit. The Y-intercept is the value of Y when X=0.

Our objective is to fit a straight line to points on a scatterplot that do

not lie along a straight line (see the figure above). So we want to
find and such that the line fits the data as well as
mca-5 341

possible. First, we need to define what we mean by a ``best'' fit. We

want a line that is in some sense closest to all of the data points
simultaneously. In statistics, we define a residual, , as the vertical
distance between a point and the line,

(see the vertical line in the figure) Since residuals can be positive or
negative, we will square them to remove the sign. By adding up all
of the squared residuals, we get a measure of how far away from
the data our line is. Thus, the ``best'' line will be one which has the
minimum sum of squared residuals, i.e., min . This method
of finding a line is called least squares.

The formulas for the slope and intercept of the least squares line
are

Using algebra, we can express the slope as

Least Squares Linear Regression

In a cause and effect relationship, the independent variable is the

cause, and the dependent variable is the effect. Least squares
linear regression is a method for predicting the value of a
dependent variable Y, based on the value of an independent
variable X.

In this tutorial, we focus on the case where there is only one

independent variable. This is called simple regression (as opposed
to multiple regression, which handles two or more independent
variables).

Tip: The next lesson presents a simple regression example that

shows how to apply the material covered in this lesson. Since this
lesson is a little dense, you may benefit by also reading the next
lesson.

Prediction

Given a least squares line, we can use it for prediction. The

equation for prediction is simply the equation for a line with and
replaced by their estimates. The predicted value of y is traditionally
mca-5 342

denoted (``y-hat''). Thus, suppose we are given the least squares

equation

where x is the age of a child in months and y is the height of that

child, and let's further assume that the range of x is from 1 to 24
months. To predict the height of an 18 month old child, we just plug
in to get

What if we wanted to know the height of a child at age 32 months?

From our least squares equation, we could get a prediction.
However, we're predicting outside of the range of our x values. This
is called extrapolation and is not ``legal'' in good statistics unless
you are very sure that the line is valid. When we predict within the
range of our x values, this is known as interpolation; this is the way
we want to predict.

Prerequisites for Regression

Simple linear regression is appropriate when the following

conditions are satisfied.

 The dependent variable Y has a linear relationship to the

independent variable X. To check this, make sure that the
XY scatterplot is linear and that the residual plot shows a
random pattern.

 For each value of X, the probability distribution of Y has the

same standard deviation σ. When this condition is satisfied,
the variability of the residuals will be relatively constant
across all values of X, which is easily checked in a residual
plot.

 For any given value of X,

 The Y values are independent, as indicated by a

random pattern on the residual plot.
 The Y values are roughly normally distributed (i.e.,
symmetric and unimodal). A little skewness is ok if the
sample size is large. A histogram or a dotplot will
show the shape of the distribution.

The Least Squares Regression Line

mca-5 343

Linear regression finds the straight line, called the least squares
regression line or LSRL, that best represents observations in a
bivariate data set. Suppose Y is a dependent variable, and X is an
independent variable. The population regression line is:

Y = Β0 + Β1X

where Β0 is a constant, Β1 is the regression coefficient, X is the

value of the independent variable, and Y is the value of the
dependent variable.

Given a random sample of observations, the population regression

line is estimated by:

ŷ = b0 + b1x

where b0 is a constant, b1 is the regression coefficient, x is the

value of the independent variable, and ŷ is the predicted value of
the dependent variable.

How to Define a Regression Line

Normally, you will use a computational tool - a software package

(e.g., Excel) or a graphing calculator - to find b0 and b1. You enter
the X and Y values into your program or calculator, and the tool
solves for each parameter.

In the unlikely event that you find yourself on a desert island without
a computer or a graphing calculator, you can solve for b0 and b1 "by
hand". Here are the equations.

b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2]

b1 = r * (sy / sx)
b0 = y - b1 * x

where b0 is the constant in the regression equation, b1 is the

regression coefficient, r is the correlation between x and y, xi is the
X value of observation i, yi is the Y value of observation i, x is the
mean of X, y is the mean of Y, sx is the standard deviation of X, and
sy is the standard deviation of Y

Properties of the Regression Line

When the regression parameters (b0 and b1) are defined as

described above, the regression line has the following properties.
mca-5 344

 The line minimizes the sum of squared differences between

observed values (the y values) and predicted values (the ŷ
values computed from the regression equation).
 The regression line passes through the mean of the X values
(x) and the mean of the Y values (y).
 The regression constant (b0) is equal to the y intercept of the
regression line.
 The regression coefficient (b1) is the average change in the
dependent variable (Y) for a 1-unit change in the
independent variable (X). It is the slope of the regression
line.

The least squares regression line is the only straight line that has
all of these properties.

The Coefficient of Determination

The coefficient of determination (denoted by R2) is a key output

of regression analysis. It is interpreted as the proportion of the
variance in the dependent variable that is predictable from the
independent variable.

 The coefficient of determination ranges from 0 to 1.

 An R2 of 0 means that the dependent variable cannot be
predicted from the independent variable.
 An R2 of 1 means the dependent variable can be predicted
without error from the independent variable.
 An R2 between 0 and 1 indicates the extent to which the
dependent variable is predictable. An R2 of 0.10 means that
10 percent of the variance in Y is predictable from X; an R2
of 0.20 means that 20 percent is predictable; and so on.

The formula for computing the coefficient of determination for a

linear regression model with one independent variable is given
below.

Coefficient of determination. The coefficient of determination (R2)

for a linear regression model with one independent variable is:

R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2

where N is the number of observations used to fit the model, Σ is

the summation symbol, xi is the x value for observation i, x is the
mean x value, yi is the y value for observation i, y is the mean y
value, σx is the standard deviation of x, and σy is the standard
deviation of y.

Standard Error
mca-5 345

The standard error about the regression line (often denoted by

SE) is a measure of the average amount that the regression
equation over- or under-predicts. The higher the coefficient of
determination, the lower the standard error; and the more accurate
predictions are likely to be.

Test Your Understanding of This Lesson

Problem 1

A researcher uses a regression equation to predict home heating

bills (dollar cost), based on home size (square feet). The correlation
between predicted bills and home size is 0.70. What is the correct
interpretation of this finding?

(A) 70% of the variability in home heating bills can be explained by

home size.
(B) 49% of the variability in home heating bills can be explained by
home size.
(C) For each added square foot of home size, heating bills
increased by 70 cents.
(D) For each added square foot of home size, heating bills
increased by 49 cents.
(E) None of the above.

Solution

The correct answer is (B). The coefficient of determination

measures the proportion of variation in the dependent variable that
is predictable from the independent variable. The coefficient of
determination is equal to R2; in this case, (0.70)2 or 0.49. Therefore,
49% of the variability in heating bills can be explained by home
size.

Regression Equation

A regression equation allows us to express the relationship

between two (or more) variables algebraically. It indicates the
nature of the relationship between two (or more) variables. In
particular, it indicates the extent to which you can predict some
variables by knowing others, or the extent to which some are
associated with others.

A linear regression equation is usually written

Y = a + bX + e
where
Y is the dependent variable
a is the intercept
b is the slope or regression coefficient
mca-5 346

X is the independent variable (or covariate)

e is the error term

The equation will specify the average magnitude of the expected

change in Y given a change in X.

The regression equation is often represented on a scatterplot by a

regression line.

A Simple Regression Example

In this lesson, we show how to apply regression analysis to some

fictitious data, and we show how to interpret the results of our
analysis.

Note: Regression computations are usually handled by a software

package or a graphing calculator. For this example, however, we
will do the computations "manually", since the gory details have
educational value.

Problem Statement

Last year, five randomly selected students took a math aptitude test
before they began their statistics course. The Statistics Department
has three questions.

 What linear regression equation best predicts statistics

performance, based on math aptitude scores?
 If a student made an 80 on the aptitude test, what grade
would we expect her to make in statistics?
 How well does the regression equation fit the data?

How to Find the Regression Equation

In the table below, the xi column shows scores on the aptitude test.
Similarly, the yi column shows statistics grades. The last two rows
show sums and mean scores that we will use to conduct the
regression analysis.

(xi - (yi - (xi - (yi - (xi - x)(yi

Student xi yi
x) y) x)2 y)2 - y)
mca-5 347

1 95 85 17 8 289 64 136
2 85 95 7 18 49 324 126
3 80 70 2 -7 4 49 -14
4 70 65 -8 -12 64 144 96
5 60 70 -18 -7 324 49 126
Sum 390 385 730 630 470
Mean 78 77

The regression equation is a linear equation of the form: ŷ = b0 +

b1x . To conduct a regression analysis, we need to solve for b0 and
b1. Computations are shown below.

b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - b0 = y - b1 * x

x)2] b0 = 77 - (0.644)(78) =
b1 = 470/730 = 0.644 26.768

Therefore, the regression equation is: ŷ = 26.768 + 0.644x .

How to Use the Regression Equation

Once you have the regression equation, using it is a snap. Choose

a value for the independent variable (x), perform the computation,
and you have an estimated value (ŷ) for the dependent variable.

In our example, the independent variable is the student's score on

the aptitude test. The dependent variable is the student's statistics
grade. If a student made an 80 on the aptitude test, the estimated
statistics grade would be:

ŷ = 26.768 + 0.644x = 26.768 + 0.644 * 80 = 26.768 + 51.52 =

78.288

Warning: When you use a regression equation, do not use values

for the independent variable that are outside the range of values
used to create the equation. That is called extrapolation, and it
can produce unreasonable estimates.

In this example, the aptitude test scores used to create the

regression equation ranged from 60 to 95. Therefore, only use
values inside that range to estimate statistics grades. Using values
outside that range (less than 60 or greater than 95) is problematic.
mca-5 348

How to Find the Coefficient of Determination

Whenever you use a regression equation, you should ask how well
the equation fits the data. One way to assess fit is to check the
coefficient of determination, which can be computed from the
following formula.

R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2

where N is the number of observations used to fit the model, Σ is

σx = sqrt [ Σ ( xi - x )2 / N ] σy = sqrt [ Σ ( yi - y )2 / N ]
σx = sqrt( 730/5 ) = sqrt(146) = σy = sqrt( 630/5 ) = sqrt(126) =
12.083 11.225

R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2

R2 = [ ( 1/5 ) * 470 / ( 12.083 * 11.225 ) ]2 = ( 94 / 135.632 )2 = (
0.693 )2 = 0.48

A coefficient of determination equal to 0.48 indicates that about

48% of the variation in statistics grades (the dependent variable)
can be explained by the relationship to math aptitude scores (the
independent variable). This would be considered a good fit to the
data, in the sense that it would substantially improve an educator's
ability to predict student performance in statistics class.

Regression Line

A regression line is a line drawn through the points on a scatterplot

to summarise the relationship between the variables being studied.
When it slopes down (from top left to bottom right), this indicates a
negative or inverse relationship between the variables; when it
slopes up (from bottom right to top left), a positive or direct
relationship is indicated.

The regression line often represents the regression equation on a

scatterplot.

Simple Linear Regression

mca-5 349

Simple linear regression aims to find a linear relationship between a

response variable and a possible predictor variable by the method
of least squares.

Multiple Regression

Multiple linear regression aims is to find a linear relationship

between a response variable and several possible predictor
variables.

Nonlinear Regression

Nonlinear regression aims to describe the relationship between a

response variable and one or more explanatory variables in a non-
linear fashion.

Residual

Residual (or error) represents unexplained (or residual) variation

after fitting a regression model. It is the difference (or left over)
between the observed value of the variable and the value
suggested by the regression model.

Residuals

Earlier, we defined the residuals as the vertical distance between

the fitted regression line and the data points. Another way to look at
this is, the residual is the difference between the predicted value
and the observed value,

Note that the sum of residuals, , always equals zero.

However, the quantity we minimized to obtain our least squares
equation was the residual sum of squares, SSRes,

An alternative formula for SSRes is

mca-5 350

Notice that the residual sum of squares is never negative. The

larger SSRes is, the further away from the line data points fall.

 A data point which does not follow the patterns of the

majority of the data..
 Point A point whose removal makes a large difference in the
equation of the regression line.
 Variable A variable which affects the response but is not one
of the explanatory variables.
 Plot A plot of the residuals versus x (or the fitted value). We
do not want any pattern in this plot.

Multiple Regression Correlation Coefficient

The multiple regression correlation coefficient, R², is a measure of

the proportion of variability explained by, or due to the regression
(linear relationship) in a sample of paired data. It is a number
between zero and one and a value close to zero suggests a poor
model.

A very high value of R² can arise even though the relationship

between the two variables is non-linear. The fit of a model should
never simply be judged from the R² value.

Stepwise Regression

A 'best' regression model is sometimes developed in stages. A list

of several potential explanatory variables are available and this list
is repeatedly searched for variables which should be included in the
model. The best explanatory variable is used first, then the second
best, and so on. This procedure is known as stepwise regression.

Dummy Variable (in regression)

In regression analysis we sometimes need to modify the form of

non-numeric variables, for example sex, or marital status, to allow
their effects to be included in the regression model. This can be
done through the creation of dummy variables whose role it is to
identify each level of the original variables separately.

Transformation to Linearity

Transformations allow us to change all the values of a variable by

using some mathematical operation, for example, we can change a
number, group of numbers, or an equation by multiplying or dividing
by a constant or taking the square root. A transformation to linearity
is a transformation of a response variable, or independent variable,
or both, which produces an approximate linear relationship between
the variables.
mca-5 351

x - 14 x - 14

Residuals, Outliers, and Influential Points

A linear regression model is not always appropriate for the data.

You can assess the appropriateness of the model by examining
residuals, outliers, and influential points.

Residuals

The difference between the observed value of the dependent

variable (y) and the predicted value (ŷ) is called the residual (e).
Each data point has one residual.

Residual = Observed value - Predicted value

e=y-ŷ

Both the sum and the mean of the residuals are equal to zero. That
is, Σ e = 0 and e = 0.

Residual Plots

A residual plot is a graph that shows the residuals on the vertical

axis and the independent variable on the horizontal axis. If the
points in a residual plot are randomly dispersed around the
horizontal axis, a linear regression model is appropriate for the
data; otherwise, a non-linear model is more appropriate.

Below the table on the left summarizes regression results from the
from the example presented in a previous lesson, and the chart on
the right displays those results as a residual plot. The residual plot
shows a non-random pattern - negative residuals on the low end of
the X axis and positive residuals on the high end. This indicates
that a non-linear model will provide a much better fit to the data. Or
it may be possible to "transform" the data to allow us to use a linear
model. We discuss linear transformations in the next lesson.
mca-5 352

x 95 85 80 70 60
y 85 95 70 65 70
81.5 78.2
ŷ 87.95 71.84 65.41
1 9
- -
10.9
e 4.51 1.29 5.15 11.5
5
9 9

Below, the residual plots show three typical patterns. The first plot
shows a random pattern, indicating a good fit for a linear model.
The other plot patterns are non-random (U-shaped and inverted U),
suggesting a better fit for a non-linear model.

Random pattern Non-random: U- Non-random:

shaped curve Inverted U

Outliers

Data points that diverge from the overall pattern and have large
residuals are called outliers.

Outliers limit the fit of the regression equation to the data. This is
illustrated in the scatterplots below. The coefficient of determination
is bigger when the outlier is not present.

Without Outlier With Outlier

Regression equation: ŷ = Regression equation: ŷ =

104.78 - 4.10x 97.51 - 3.32x
Coefficient of determination: Coefficient of determination:
R2 = 0.94 R2 = 0.55

Influential Points
mca-5 353

Influential points are data points with extreme values that greatly
affect the the slope of the regression line.

The charts below compare regression statistics for a data set with
and without an influential point. The chart on the right has a single
influential point, located at the high end of the X axis (where x =
24). As a result of that single influential point, the slope of the
regression line increases dramatically, from -2.5 to -1.6.

Note that this influential point, unlike the outliers discussed above,
did not reduce the coefficient of determination. In fact, the
coefficient of determination was bigger when the influential point
was present.

Without Influential Point With Influential Point

Regression equation: ŷ = Regression equation: ŷ =

92.54 - 2.5x 87.59 - 1.6x
Slope: b0 = -2.5 Slope: b0 = -1.6
Coefficient of determination: Coefficient of determination:
R2 = 0.46 R2 = 0.52

Test Your Understanding of This Lesson

In the context of regression analysis, which of the following

statements are true?

I. When the sum of the residuals is greater than zero, the model
is nonlinear.
II. Outliers reduce the coefficient of determination.
III. Influential points reduce the correlation coefficient.

(A) I only
(B) II only
(C) III only
(D) I and II only
(E) I, II, and III

Solution

The correct answer is (B). Outliers reduce the ability of a regression

model to fit the data, and thus reduce the coefficient of
determination. The sum of the residuals is always zero, whether the
mca-5 354

regression model is linear or nonlinear. And influential points often

increase the correlation coefficient.

Transformations to Achieve Linearity

When a residual plot reveals a data set to be nonlinear, it is often

possible to "transform" the raw data to make it linear. This allows us
to use linear regression techniques appropriately with nonlinear
data.

What is a Transformation to Achieve Linearity?

Transforming a variable involves using a mathematical operation to

change its measurement scale. Broadly speaking, there are two
kinds of transformations.

 Linear transformation. A linear transformation preserves

linear relationships between variables. Therefore, the
correlation between x and y would be unchanged after a
linear transformation. Examples of a linear transformation to
variable x would be multiplying x by a constant, dividing x by
a constant, or adding a constant to x.
 Nonlinear tranformation. A nonlinear transformation changes
(increases or decreases) linear relationships between
variables and, thus, changes the correlation between
variables. Examples of a nonlinear transformation of variable
x would be taking the square root of x or the reciprocal of x.

In regression, a transformation to achieve linearity is a special kind

of nonlinear transformation. It is a nonlinear transformation that
increases the linear relationship between two variables.

Methods of Transforming Variables to Achieve Linearity

There are many ways to transform variables to achieve linearity for

regression analysis. Some common methods are summarized
below.

Regression Predicted
Method Transformation(s)
equation value (ŷ)
Standard linear
None y = b0 + b1x ŷ = b0 + b1x
regression
Exponential Dependent variable log(y) = b0 +
ŷ = 10b0 + b1x
model = log(y) b1x
Quadratic Dependent variable sqrt(y) = b0 + ŷ = ( = b0 +
model = sqrt(y) b1x b1x )2
mca-5 355

Reciprocal Dependent variable ŷ = 1 / ( b0

1/y = b0 + b1x
model = 1/y + b1x )
Logarithmic Independent y= b0 + ŷ = b0 +
model variable = log(x) b1log(x) b1log(x)
Power model Dependent variable log(y)= b0 + ŷ = 10b0 +
b log(x)
= log(y) b1log(x) 1
Independent
variable = log(x)

Each row shows a different nonlinear transformation method. The

second column shows the specific transformation applied to
dependent and/or independent variables. The third column shows
the regression equation used in the analysis. And the last column
shows the "back transformation" equation used to restore the
dependent variable to its original, non-transformed measurement
scale.

In practice, these methods need to be tested on the data to which

they are applied to be sure that they increase rather than decrease
the linearity of the relationship. Testing the effect of a
transformation method involves looking at residual plots and
correlation coefficients, as described in the following sections.

Note: The logarithmic model and the power model require the
ability to work with logarithms. Use a graphic calculator to obtain
the log of a number or to transform back from the logarithm to the
original number. If you need it, the Stat Trek glossary has a brief
refresher on logarithms.

How to Perform a Transformation to Achieve Linearity

Transforming a data set to achieve linearity is a multi-step, trial-

and-error process.

 Choose a transformation method (see above table).

 Transform the independent variable, dependent variable, or
both.
 Plot the independent variable against the dependent
variable, using the transformed data.
 If the scatterplot is linear, proceed to the next step.
 If the plot is not linear, return to Step 1 and try a
different approach. Choose a different transformation
method and/or transform a different variable.
 Conduct a regression analysis, using the transformed
variables.
 Create a residual plot, based on regression results.
mca-5 356

 If the residual plot shows a linear pattern, the

transformation was successful. Congratulations!
 If the plot pattern is nonlinear, return to Step 1 and try
a different approach.

The best tranformation method (exponential model, quadratic

model, reciprocal model, etc.) will depend on nature of the original
data. The only way to determine which method is best is to try each
and compare the result (i.e., residual plots, correlation coefficients).

A Transformation Example

Below, the table on the left shows data for independent and
dependent variables - x and y, respectively. When we apply a linear
regression to the raw data, the residual plot shows a non-random
pattern (a U-shaped curve), which suggests that the data are
nonlinear.

x 1 2 3 4 5 6 7 8 9

1 1 3 4 7 7
y 2 1 6
4 5 0 0 4 5

Suppose we repeat the analysis, using a quadratic model to

transform the dependent variable. For a quadratic model, we use
the square root of y, rather than y, as the dependent variable. The
table below shows the data we analyzed.

x 1 2 3 4 5 6 7 8 9

1. 1. 2. 3. 3. 5. 6. 8. 8.
y
14 00 45 74 87 48 32 60 66

The residual plot (above right) suggests that the transformation to

achieve linearity was successful. The pattern of residuals is
random, suggesting that the relationship between the independent
variable (x) and the transformed dependent variable (square root of
y) is linear. And the coefficient of determination was 0.96 with the
transformed data versus only 0.88 with the raw data. The
transformed data resulted in a better model.

Test Your Understanding of This Lesson

Problem
mca-5 357

In the context of regression analysis, which of the following

statements are true?

I. A linear transformation increases the linear relationship

between variables.
II. A logarithmic model is the most effective transformation
method.
III. A residual plot reveals departures from linearity.

(A) I only
(B) II only
(C) III only
(D) I and II only
(E) I, II, and III

Solution

The correct answer is (C). A linear transformation neither increases

nor decreases the linear relationship between variables; it
preserves the relationship. A nonlinear transformation is used to
increase the relationship between variables. The most effective
transformation method depends on the data being transformed. In
some cases, a logarithmic model may be more effective than other
methods; but it other cases it may be less effective. Non-random
patterns in a residual plot suggest a departure from linearity in the
data being plotted.

Estimate Regression Slope

This lesson describes how to construct a confidence interval to

estimate the slope of a regression line

ŷ = b0 + b1x

where b0 is a constant, b1 is the slope (also called the regression

coefficient), x is the value of the independent variable, and ŷ is the
predicted value of the dependent variable.

Estimation Requirements

The approach described in this lesson is valid whenever the

standard requirements for simple linear regression are met.

 The dependent variable Y has a linear relationship to the

independent variable X.
 For each value of X, the probability distribution of Y has the
same standard deviation σ.
 For any given value of X,
mca-5 358

 The Y values are independent.

 The Y values are roughly normally distributed (i.e.,
symmetric and unimodal). A little skewness is ok if the
sample size is large.

Previously, we described how to verify that regression requirements

are met.

The Variability of the Slope Estimate

To construct a confidence interval for the slope of the regression

line, we need to know the standard error of the sampling
distribution of the slope. Many statistical software packages and
some graphing calculators provide the standard error of the slope
as a regression analysis output. The table below shows
hypothetical output for the following regression equation: y = 76 +
35x .

Predictor Coef SE Coef T P

Constant 76 30 2.53 0.01
X 35 20 1.75 0.04

In the output above, the standard error of the slope (shaded in

gray) is equal to 20. In this example, the standard error is referred
to as "SE Coeff". However, other software packages might use a
different label for the standard error. It might be "StDev", "SE", "Std
Dev", or something else.

If you need to calculate the standard error of the slope (SE) by

hand, use the following formula:

SE = sb1 = sqrt [ Σ(yi - ŷi)2 / (n - 2) ] / sqrt [ Σ(xi - x)2 ]

where yi is the value of the dependent variable for observation i, ŷi

is estimated value of the dependent variable for observation i, xi is
the observed value of the independent variable for observation i, x
is the mean of the independent variable, and n is the number of
observations.

How to Find the Confidence Interval for the Slope of a

Regression Line

Previously, we described how to construct confidence intervals.

The confidence interval for the slope uses the same general
approach. Note, however, that the critical value is based on a t
score with n - 2 degrees of freedom.
mca-5 359

 Identify a sample statistic. The sample statistic is the

regression slope b1 calculated from sample data. In the table
above, the regression slope is 35.

 Select a confidence level. The confidence level describes

the uncertainty of a sampling method. Often, researchers
choose 90%, 95%, or 99% confidence levels; but any
percentage can be used.

 Find the margin of error. Previously, we showed how to

compute the margin of error, based on the critical value and
standard error. When calculating the margin of error for a
regression slope, use a t score for the critical value, with
degrees of freedom (DF) equal to n - 2.

 Specify the confidence interval. The range of the confidence

interval is defined by the sample statistic + margin of error.
And the uncertainty is denoted by the confidence level.

In the next section, we work through a problem that shows how to

use this approach to construct a confidence interval for the slope of
a regression line.

Test Your Understanding of This Lesson

Problem 1

The local utility company surveys 101 randomly selected

customers. For each survey participant, the company collects the
following: annual electric bill (in dollars) and home size (in square
feet). Output from a regression analysis appears below.

Regression equation: Annual bill = 0.55 * Home size + 15

Predictor Coef SE Coef T P

Constant 15 3 5.0 0.00
Home size 0.55 0.24 2.29 0.01

What is the 99% confidence interval for the slope of the regression
line?

(A) 0.25 to 0.85

(B) 0.02 to 1.08
(C) -0.08 to 1.18
(D) 0.20 to 1.30
(e] 0.30 to 1.40

Solution
mca-5 360

The correct answer is (C). Use the following four-step approach to

construct a confidence interval.

 Identify a sample statistic. Since we are trying to estimate

the slope of the true regression line, we use the regression
coefficient for home size (i.e., the sample estimate of slope)
as the sample statistic. From the regression output, we see
that the slope coefficient is 0.55.

 Select a confidence level. In this analysis, the confidence

level is defined for us in the problem. We are working with a
99% confidence level.

 Find the margin of error. Elsewhere on this site, we show

how to compute the margin of error. The key steps applied to
this problem are shown below.

 Find standard deviation or standard error. The

standard error is given in the regression output. It is
0.24.
 Find critical value. The critical value is a factor used to
compute the margin of error. With simple linear
regression, to compute a confidence interval for the
slope, the critical value is a t score with degrees of
freedom equal to n - 2. To find the critical value, we
take these steps.

o Compute alpha (α): α = 1 - (confidence level /

100) = 1 - 99/100 = 0.01
o Find the critical probability (p*): p* = 1 - α/2 = 1
- 0.01/2 = 0.995
o Find the degrees of freedom (df): df = n - 2 =
101 - 2 = 99.
o The critical value is the t score having 99
degrees of freedom and a cumulative
probability equal to 0.995. From the t
Distribution Calculator, we find that the critical
value is 2.63.
 Compute margin of error (ME): ME = critical value *
standard error = 2.63 * 0.24 = 0.63
 Specify the confidence interval. The range of the confidence
interval is defined by the sample statistic + margin of error.
And the uncertainty is denoted by the confidence level.

Therefore, the 99% confidence interval is -0.08 to 1.18. That is, we

are 99% confident that the true slope of the regression line is in the
range defined by 0.55 + 0.63.
mca-5 361

Hypothesis Test for Slope of Regression Line

This lesson describes how to conduct a hypothesis test to

determine whether there is a significant linear relationship between
an independent variable X and a dependent variable Y. The test
focuses on the slope of the regression line

Y = Β0 + Β1X

where Β0 is a constant, Β1 is the slope (also called the regression

coefficient), X is the value of the independent variable, and Y is the
value of the dependent variable.

Test Requirements

The approach described in this lesson is valid whenever the

standard requirements for simple linear regression are met.

 The dependent variable Y has a linear relationship to the

independent variable X.
 For each value of X, the probability distribution of Y has the
same standard deviation σ.
 For any given value of X,

 The Y values are independent.

 The Y values are roughly normally distributed (i.e.,
symmetric and unimodal). A little skewness is ok if the
sample size is large.

Previously, we described how to verify that regression requirements

are met.

The test procedure consists of four steps: (1) state the hypotheses,
(2) formulate an analysis plan, (3) analyze sample data, and (4)
interpret results.

State the Hypotheses

If there is a significant linear relationship between the independent

variable X and the dependent variable Y, the slope will not equal
zero.

H0: Β1 = 0
Ha: Β1 ≠ 0

The null hypothesis states that the slope is equal to zero, and the
alternative hypothesis states that the slope is not equal to zero.

Formulate an Analysis Plan

mca-5 362

The analysis plan describes how to use sample data to accept or

reject the null hypothesis. The plan should specify the following
elements.

 Significance level. Often, researchers choose significance

levels equal to 0.01, 0.05, or 0.10; but any value between 0
and 1 can be used.

 Test method. Use a linear regression t-test (described in the

next section) to determine whether the slope of the
regression line differs significantly from zero.

Analyze Sample Data

Using sample data, find the standard error of the slope, the slope of
the regression line, the degrees of freedom, the test statistic, and
the P-value associated with the test statistic. The approach
described in this section is illustrated in the sample problem at the
end of this lesson.

 Standard error. Many statistical software packages and

some graphing calculators provide the standard error of the
slope as a regression analysis output. The table below
shows hypothetical output for the following regression
equation: y = 76 + 35x .

Predictor Coef SE Coef T P

Constant 76 30 2.53 0.01
X 35 20 1.75 0.04

 In the output above, the standard error of the slope (shaded

in gray) is equal to 20. In this example, the standard error is
referred to as "SE Coeff". However, other software packages
might use a different label for the standard error. It might be
"StDev", "SE", "Std Dev", or something else.

If you need to calculate the standard error of the slope (SE)

by hand, use the following formula:
 SE = sb1 = sqrt [ Σ(yi - ŷi)2 / (n - 2) ] / sqrt [ Σ(xi - x)2 ]
 where yi is the value of the dependent variable for
observation i, ŷi is estimated value of the dependent variable
for observation i, xi is the observed value of the independent
variable for observation i, x is the mean of the independent
variable, and n is the number of observations.

 Slope. Like the standard error, the slope of the regression

line will be provided by most statistics software packages. In
the hypothetical output above, the slope is equal to 35.
mca-5 363

 Degrees of freedom. The degrees of freedom (DF) is equal

to:

DF = n - 2

where n is the number of observations in the sample.

 Test statistic. The test statistic is a t-score (t) defined by the

following equation.

t = b1 / SE

where b1 is the slope of the sample regression line, and SE

is the standard error of the slope.

 P-value. The P-value is the probability of observing a sample

statistic as extreme as the test statistic. Since the test
statistic is a t-score, use the t Distribution Calculator to
assess the probability associated with the test statistic. Use
the degrees of freedom computed above.

Interpret Results

If the sample findings are unlikely, given the null hypothesis, the
researcher rejects the null hypothesis. Typically, this involves
comparing the P-value to the significance level, and rejecting the
null hypothesis when the P-value is less than the significance level.

Test Your Understanding of This Lesson

Problem

The local utility company surveys 101 randomly selected

customers. For each survey participant, the company collects the
following: annual electric bill (in dollars) and home size (in square
feet). Output from a regression analysis appears below.

Regression equation: Annual bill = 0.55 * Home size + 15

Predictor Coef SE Coef T P

Constant 15 3 5.0 0.00
Home size 0.55 0.24 2.29 0.01

Is there a significant linear relationship between annual bill and

home size? Use a 0.05 level of significance.
mca-5 364

Solution

The solution to this problem takes four steps: (1) state the
hypotheses, (2) formulate an analysis plan, (3) analyze sample
data, and (4) interpret results. We work through those steps below:

 State the hypotheses. The first step is to state the null

hypothesis and an alternative hypothesis.

H0: The slope of the regression line is equal to zero.

Ha: The slope of the regression line is not equal to zero.

If the relationship between home size and electric bill is

significant, the slope will not equal zero.

 Formulate an analysis plan. For this analysis, the

significance level is 0.05. Using sample data, we will conduct
a linear regression t-test to determine whether the slope of
the regression line differs significantly from zero.
 Analyze sample data. To apply the linear regression t-test
to sample data, we require the standard error of the slope,
the slope of the regression line, the degrees of freedom, the
t-score test statistic, and the P-value of the test statistic.

We get the slope (b1) and the standard error (SE) from the
regression output.

b1 = 0.55 SE = 0.24

We compute the degrees of freedom and the t-score test

statistic, using the following equations.

DF = n - 2 = 101 - 2 = 99

t = b1/SE = 0.55/0.24 = 2.29

where DF is the degrees of freedom, n is the number of

observations in the sample, b1 is the slope of the regression
line, and SE is the standard error of the slope.

Based on the t-score test statistic and the degrees of

freedom, we determine the P-value. The P-value is the
probability that a t-score having 99 degrees of freedom is
more extreme than 2.29. Since this is a two-tailed test,
"more extreme" means greater than 2.29 or less than -2.29.
We use the t Distribution Calculator to find P(t > 2.29) =
0.0121 and P(t < 12.29) = 0.0121. Therefore, the P-value is
0.0121 + 0.0121 or 0.0242.
mca-5 365

 Interpret results. Since the P-value (0.0242) is less than the

significance level (0.05), we cannot accept the null
hypothesis.

Note: If you use this approach on an exam, you may also want to
mention that this approach is only appropriate when the standard
requirements for simple linear regression are satisfied.

Hypothesis Test for Slope of Regression Line

This lesson describes how to conduct a hypothesis test to

determine whether there is a significant linear relationship between
an independent variable X and a dependent variable Y. The test
focuses on the slope of the regression line

Y = Β0 + Β1X

where Β0 is a constant, Β1 is the slope (also called the regression

coefficient), X is the value of the independent variable, and Y is the
value of the dependent variable.

Test Requirements

The approach described in this lesson is valid whenever the

standard requirements for simple linear regression are met.

 The dependent variable Y has a linear relationship to the

independent variable X.
 For each value of X, the probability distribution of Y has the
same standard deviation σ.
 For any given value of X,

 The Y values are independent.

 The Y values are roughly normally distributed (i.e.,
symmetric and unimodal). A little skewness is ok if the
sample size is large.

Previously, we described how to verify that regression requirements

are met.

The test procedure consists of four steps: (1) state the hypotheses,
(2) formulate an analysis plan, (3) analyze sample data, and (4)
interpret results.

State the Hypotheses

If there is a significant linear relationship between the independent

variable X and the dependent variable Y, the slope will not equal
zero.
mca-5 366

H0: Β1 = 0
Ha: Β1 ≠ 0

The null hypothesis states that the slope is equal to zero, and the
alternative hypothesis states that the slope is not equal to zero.

Formulate an Analysis Plan

The analysis plan describes how to use sample data to accept or

reject the null hypothesis. The plan should specify the following
elements.

 Significance level. Often, researchers choose significance

levels equal to 0.01, 0.05, or 0.10; but any value between 0
and 1 can be used.

 Test method. Use a linear regression t-test (described in the

next section) to determine whether the slope of the
regression line differs significantly from zero.

Analyze Sample Data

 Standard error. Many statistical software packages and

some graphing calculators provide the standard error of the
slope as a regression analysis output. The table below
shows hypothetical output for the following regression
equation: y = 76 + 35x .

Predictor Coef SE Coef T P

Constant 76 30 2.53 0.01
X 35 20 1.75 0.04


In the output above, the standard error of the slope (shaded
in gray) is equal to 20. In this example, the standard error is
referred to as "SE Coeff". However, other software packages
might use a different label for the standard error. It might be
"StDev", "SE", "Std Dev", or something else.

If you need to calculate the standard error of the slope (SE)

by hand, use the following formula:
 SE = sb1 = sqrt [ Σ(yi - ŷi)2 / (n - 2) ] / sqrt [ Σ(xi - x)2 ]
mca-5 367

 where yi is the value of the dependent variable for

observation i, ŷi is estimated value of the dependent variable
for observation i, xi is the observed value of the independent
variable for observation i, x is the mean of the independent
variable, and n is the number of observations.

 Slope. Like the standard error, the slope of the regression

line will be provided by most statistics software packages. In
the hypothetical output above, the slope is equal to 35.

 Degrees of freedom. The degrees of freedom (DF) is equal

to:

DF = n - 2

where n is the number of observations in the sample.

 Test statistic. The test statistic is a t-score (t) defined by the

following equation.

t = b1 / SE

where b1 is the slope of the sample regression line, and SE

is the standard error of the slope.

 P-value. The P-value is the probability of observing a sample

Interpret Results

Test Your Understanding of This Lesson

Problem

The local utility company surveys 101 randomly selected

customers. For each survey participant, the company collects the
following: annual electric bill (in dollars) and home size (in square
feet). Output from a regression analysis appears below.

Regression equation: Annual bill = 0.55 * Home size + 15

mca-5 368

Predictor Coef SE Coef T P

Constant 15 3 5.0 0.00
Home size 0.55 0.24 2.29 0.01

Is there a significant linear relationship between annual bill and

home size? Use a 0.05 level of significance.

Solution

The solution to this problem takes four steps: (1) state the
hypotheses, (2) formulate an analysis plan, (3) analyze sample
data, and (4) interpret results. We work through those steps below:

 State the hypotheses. The first step is to state the null

hypothesis and an alternative hypothesis.

H0: The slope of the regression line is equal to zero.

Ha: The slope of the regression line is not equal to zero.

If the relationship between home size and electric bill is

significant, the slope will not equal zero.

 Formulate an analysis plan. For this analysis, the

We get the slope (b1) and the standard error (SE) from the
regression output.

b1 = 0.55 SE = 0.24

We compute the degrees of freedom and the t-score test

statistic, using the following equations.

DF = n - 2 = 101 - 2 = 99

t = b1/SE = 0.55/0.24 = 2.29

where DF is the degrees of freedom, n is the number of

observations in the sample, b1 is the slope of the regression
line, and SE is the standard error of the slope.
mca-5 369

Based on the t-score test statistic and the degrees of

 Interpret results. Since the P-value (0.0242) is less than the

significance level (0.05), we cannot accept the null
hypothesis.

Note: If you use this approach on an exam, you may also want to
mention that this approach is only appropriate when the standard
requirements for simple linear regression are satisfied.

Coefficient of Determination

A statistic that is widely used to determine how well a regression

fits is the coefficient of determination (or multiple correlation
coefficient), . represents the fraction of variability in y that can
be explained by the variability in x. In other words, explains how
much of the variability in the y's can be explained by the fact that
they are related to x, i.e., how close the points are to the line. The
equation for is

where SSTotal is the total sums of squares of the data.

NOTE: In the simple linear regression case, is simply the square

of the correlation coefficient.

From simple regression to multiple regression

What happens if we have more than two independent variables? In

most cases, we can't draw graphs to illustrate the relationship
between them all. But we can still represent the relationship by an
equation. This is what multiple regression does. It's a
straightforward extension of simple regression. If there are n
independent variables, we call them x1, x2, x3 and so on up to xn.
Multiple regression then finds values of a, b1, b2, b3 and so on up to
bn which give the best fitting equation of the form

y = a + b1x1 + b2x2 + b3x3 + ... + bnxn

mca-5 370

b1 is called the coefficient of x1, b2 is the coefficient of x2, and so

forth. The equation is exactly like the one for simple regression,
except that it is very laborious to work out the values of a, b1 etc by
hand. Minitab, however, does it with exactly the same command as
for simple regression.
What do the regression coefficients mean? The coefficient of each
independent variable tells us what relation that variable has with y,
the dependent variable, with all the other independent variables
held constant. So, if b1 is high and positive, that means that if x2, x3
and so on up to xn do not change, then increases in x1 will
correspond to large increases in y.

Goodness of fit in multiple regression

In multiple regression, as in simple regression, we can work out a

value for R2. However, every time we add another independent
variable, we necessarily increase the value of R2 (you can get an
idea of why this happens if you compare Fig 3 with Fig 1 in the
handout on "The idea of a regression equation"). Therefore, in
assessing the goodness of fit of a regression equation, we usually
work in terms of a slightly different statistic, called R2or R2adj. This is
calculated as

R2adj = 1 - (1-R2)(N-n-1)/N-1)

where N is the number of observations in the data set (usually the

number of people) and n the number of independent variables or
regressors. This allows for the extra regressors. Check that you
can see from the formula that R2adj will always be lower than R2 if
there is more than one regressor. There is also another way of
assessing goodness of fit in multiple regression, using the F
statistic which we will meet in a moment.

The main questions multiple regression answers

Multiple regression enables us to answer five main questions about

a set of data, in which n independent variables (regressors), x1 to
xn, are being used to explain the variation in a single dependent
variable, y.

1. How well do the regressors, taken together, explain the

variation in the dependent variable? This is assessed by the
value of R2adj. As a very rough guide, in psychological
applications we would usually reckon an R2adj of above 75%
as very good; 50% to 75% as good; 25% to 50% as poor but
acceptable; and below 25% as very poor and perhaps
unacceptable. Alas, an R2adj value above 90% is very rare in
psychology, and should make you wonder whether there is
some artefact in your data.
mca-5 371

2. Are the regressors, taken together, significantly associated

with the dependent variable? This is assessed by the
statistic F in the "Analysis of Variance" or anova part of the
regression output. F is like some other statistics (e.g. t, chi2)
in that its significance depends on its degrees of freedom,
which in turn depend on sample sizes and/or the nature of
the test used. Unlike t, though, F has two degrees of
freedom associated with it. In general they are referred to as
the numerator and denominator degrees of freedom
(because F is actually a ratio). In regression, the numerator
degrees of freedom are associated with the regression, and
the denominator degrees of freedom with the residual or
error; you can find them in the Regression and Error rows of
the anova table in the Minitab output. If you were finding the
significance of an F value by looking it up in a book of tables,
you would need the degrees of freedom to do it. Minitab
works out significances for you, and you will find them in the
anova table next to the F value; but you need to use the
degrees of freedom when reporting the results (see below).
Note that the higher the value of F, the more significant it will
be for given degrees of freedom.
3. What relationship does each regressor have with the
dependent variable when all other regressors are held
constant? This is answered by looking at the regression
coefficients. Minitab reports these twice, once in the
regression equation and again (to an extra decimal place) in
the table of regression coefficients and associated statistics.
Note that regression coefficients have units. So if the
dependent variable is score on a psychometric test of
depression, and one of the regressors is monthly income,
the coefficient for that regressor would have units of (scale
points) per ( income per month). That means that if we
changed the units of one of the variables, the regression
coefficient would change but the relationship it is describing,
and what it is saying about it, would not. So the size of a
regression coefficient doesn't tell us anything about the
strength of the relationship it describes until we have taken
the units into account. The fact that regression coefficients
have units also means that we can give a precise
interpretation to each coefficient. So, staying with depression
score and income, a coefficient of -0.0934 (as in the worked
example on the next sheet) would mean that, with all other
variables held constant, increasing someone's income by 1
per month is associated with a decrease of depression score
of 0.0934 points (we might want to make this more
meaningful by saying that an increase in income of 100 per
month would be associated with a decrease in depression
score of 100 * 0.0934 = 9.34 scale units). As in this example,
negative coefficients mean that when the regressor
mca-5 372

increases, the dependent variable decreases. If the

regressor is a dichotomous variable (e.g. gender), the size
of the coefficient tells us the size of the difference between
the two classes of individual (again, with all other variables
heldd constant). So a gender coefficient of 3.3, with men
coded 0 and women coded 1, would mean that with all other
variables held constant, women's dependent variable scores
would average 3.3 units higher than men's.
4. Which regressor has most effect on the dependent variable?
It is not possible to give a fully satisfactory answer to this
question, for a number of reasons. The chief one is that we
are always looking at the effect of each variable in the
presence of all the others; since the dependent variable
need not be independent, it is hard to be sure which one is
contributing to a joint relationship (or even to be sure that
that means anything). However, the usual way of addressing
the question is to look at the standardised regression
coefficients or beta weights for each variable; these are
the regression coefficients we would get if we converted all
variables (independent and dependent) to z-scores before
doing the regression. Minitab, unfortunately, does not report
beta weights for the independent variable in its regression
output, though it is possible to calculate them; SPSS, which
you will learn about later in the course, does give them
directly.
5. Are the relationships of each regressor with the dependent
variable statistically significant, with all other regressors
taken into account? This is answered by looking at the t
values in the table of regression coefficients. The degrees of
freedom for t are those for the residual in the anova table,
but Minitab works out significances for us, so we need to
know the degrees of freedom only when it comes to
reporting results. Note that if a regression coefficient is
negative, Minitab will report the corresponding t value as
negative, but if you were looking it up in tables, you would
use the absolute (unsigned) value.

Further questions to ask

Either the nature of the data, or the regression results, may suggest
further questions. For example, you may want to obtain means and
standard deviations or histograms of variables to check on their
distributions; or plot one variable against another, or obtain a matrix
of correlations, to check on first order relationships. Minitab does
some checking for you automatically, and reports if it finds "unusual
observations". If there are unusual observations, PLOT or
HISTOGRAM may tell you what the possible problems are. The
usual kinds of unusual observations are "outliers" points which lie
far from the main distributions or the main trends of one or more
mca-5 373

variables. Serious outliers should be dealt with as follows:

1. temporarily remove the observations from the data set. In
Minitab, this can be done by using the LET command to set the
outlier value to "missing", indicated by an asterisk instead of a
numerical value. For example, if item 37 in the variable held in C1
looks like an outlier, we could type:
LET C1(37)='*'note the single quotes round the asterisk

2. repeat the regression and see whether the same qualitative

results are obtained (the quantitative results will inevitably be
different).
3. if the same general results are obtained, we can conclude that
the outliers are not distorting the results. Report the results of the
original regression, adding a note that removal of outliers did not
greatly affect them.
4. if different general results are obtained, accurate interpretation
will require more data to be collected. Report the results of both
regressions, and note that the interpetation of the data is uncertain.
The outliers may be due to errors of observation, data coding, etc,
and in this case they should be corrected or discarded. However,
they may also represent a subpopulation for which the effects of
interest are different from those in the main population. If they are
not due to error, the group of data contributing to outliers will need
to be identified, and if possible a reasonably sized sample collected
from it so that it can be compared with the main population. This is
a scientific rather than a statistical problem.

Reporting regression results

Research articles sometimes report the results of several different

regressions done on a single data set. In this case, it is best to
present the results in a table. Where a single regression is done,
however, that is unnecessary, and the results can be reported in
text. The wording should be something like the following this is for
the depression vs age, income and gender example used in the
class:
The data were analysed by multiple regression, using as regressors
age, income and gender. The regression was a rather poor fit (R2adj
= 40%), but the overall relationship was significant (F3,12 = 4.32, p <
0.05). With other variables held constant, depression scores were
negatively related to age and income, decreasing by 0.16 for every
extra year of age, and by 0.09 for every extra pound per week
income. Women tended to have higher scores than men, by 3.3
units. Only the effect of income was significant (t12 = 3.18, p <
0.01).
mca-5 374

Note the following:

 The above brief paragraph does not exhaust what you can
say about a set of regression results. There may be features
of the data you should look at "Unusual observations", for
example. Normally you will need to go on to discuss the
meaning of the trends you have described.
 Always report what happened before moving on to its
significance so R2adj values before F values, regression
coefficients before t values. Remember, descriptive statistics
are more important than significance tests.
 Although Minitab will give a negative t value if the
corresponding regression coefficient is negative, you should
drop the negative sign when reporting the results.
 Degrees of freedom for both F and t values must be given.
Usually they are written as subscripts. For F the numerator
degrees of freedom are given first. You can also put degrees
of freedom in parentheses, or report them explicitly, e.g.:
"F(3,12) = 4.32" or "F = 4.32, d. of f. = 3, 12".
 Significance levels can either be reported exactly (e.g. p =
0.032) or in terms of conventional levels (e.g. p < 0.05).
There are arguments in favour of either, so it doesn't much
matter which you do. But you should be consistent in any
one report.
 Beware of highly significant F or t values, whose significance
levels will be reported by statistics packages as, for
example, 0.0000. It is an act of statistical illiteracy to write p
= 0.0000; significance levels can never be exactly zero there
is always some probability that the observed data could arise
if the null hypothesis was true. What the package means is
that this probability is so low it can't be represented with the
number of columns available. We should write it as p <
0.00005.
 Beware of spurious precision, i.e. reporting coefficients etc
to huge numbers of significant figures when, on the basis
of the sample you have, you couldn't possibly expect them to
replicate to anything like that degree of precision if someone
repeated the study. F and t values are conventionally
reported to two decimal places, and R2adj values to the
nearest percentage point (sometimes to one additional
decimal place). For coefficients, you should be guided by the
sample size: with a sample size of 16, as in the example
used above, two significant figures is plenty, but even with
more realistic samples, in the range of 100 to 1000, three
significant figures is usually as far as you should go. This
means that you will usually have to round off the numbers
that Minitab will give you.
mca-5 375

Worked example of an elementary multiple regression

MTB > set c1

DATA> 74 82 15 23 35 54 12 28 66 43 55 31 83 29 53 32
DATA> end
MTB > set c2
DATA> 120 55 350 210 185 110 730 150 61 175 121 225 45 325
171 103
DATA> end
MTB > set c3
DATA> 0 0 1 0 0 1 1 0 1 1 1 0 1 0 0 1
DATA> end
MTB > set c4
DATA> 33 28 47 55 32 63 59 68 27 32 42 51 47 33 51 20
DATA> end
MTB > name c1 'depress'
MTB > name c2 'income'
MTB > name c3 'm0f1'
MTB > name c4 'age'

MTB > regress c1 3 c2-c4

The regression equation is

depress = 68.3 0.0934 income + 3.31 m0f1 - 0.162 age

Predictor Coef Stdev t-ratio p

Constant 68.28 15.44 4.42 0.001
income -0.09336 0.02937 -3.18 0.008
m0f1 3.306 8.942 0.37 0.718
age -0.1617 0.3436 -0.47 0.646

s = 17.70 R= 52.0% R= 39.9%

Analysis of Variance

SOURCE DF SS MS F p
Regression 3 4065.4 1355.1 4.32 0.028
Error 12 3760.0 313.3
Total 15 7825.4

SOURCE DF SEQ SS
income 1 3940.5
m0f1 1 55.5
age 1 69.4

Continue? y
Unusual Observations
Obs. income depress Fit Stdev.Fit Residual St.Resid
7 730 12.00 -6.10 15.57 18.10 2.15RX
mca-5 376

R denotes an obs. with a large st. resid.

X denotes an obs. whose X value gives it large influence.

LINEAR REGRESSION

1. Introduction
o very often when 2 (or more) variables are observed,
relationship between them can be visualized
o predictions are always required in economics or
physical science from existing and historical data
o regression analysis is used to help formulate these
predictions and relationships
o linear regression is a special kind of regression
analysis in which 2 variables are studied and a
straight-line relationship is assumed
o linear regression is important because
1. there exist many relationships that are of this
form
2. it provides close approximations to complicated
relationships which would otherwise be difficult
to describe
o the 2 variables are divided into (i) independent
variable and (ii) dependent variable
o Dependent Variable is the variable that we want to
forecast
o Independent Variable is the variable that we use to
make the forecast
o e.g. Time vs. GNP (time is independent, GNP is
dependent)
o scatter diagrams are used to graphically presenting
the relationship between the 2 variables
o usually the independent variable is drawn on the
horizontal axis (X) and the dependent variable on
vertical axis (Y)
o the regression line is also called the regression line of
Y on X
2. Assumptions
o there is a linear relationship as determined (observed)
from the scatter diagram
o the dependent values (Y) are independent of each
other, i.e. if we obtain a large value of Y on the first
observation, the result of the second and subsequent
observations will not necessarily provide a large
value. In simple term, there should not be auto-
correlation
o for each value of X the corresponding Y values are
normally distributed
o the standard deviations of the Y values for each value
of X are the same, i.e. homoscedasticity
mca-5 377

3. Process
o observe and note what is happening in a systematic
way
o form some kind of theory about the observed facts
o draw a scatter diagram to visualize relationship
o generate the relationship by mathematical formula
o make use of the mathematical formula to predict
4. Method of Least Squares
o from a scatter diagram, there is virtually no limit as to
the number of lines that can be drawn to make a
linear relationship between the 2 variables
o the objective is to create a BEST FIT line to the data
concerned
o the criterion is the called the method of least squares
o i.e. the sum of squares of the vertical deviations from
the points to the line be a minimum (based on the fact
that the dependent variable is drawn on the vertical
axis)
o the linear relationship between the dependent
variable (Y) and the independent variable can be
written as Y = a + bX , where a and b are parameters
describing the vertical intercept and the slope of the
regression line respectively
5. Calculating a and b

o where X and Y are the raw values of the 2 variables

o and are means of the 2 variables
6. Correlation
o when the value of one variable is related to the value
of another, they are said to be correlated
o there are 3 types of correlation: (i) perfectly
correlated; (ii) partially correlated; (iii) uncorrelated
o Coefficient of Correlation (r) measures such a
relationship
mca-5 378

o the value of r ranges from -1 (perfectly correlated in

the negative direction) to +1 (perfectly correlated in
the positive direction)
o when r = 0, the 2 variables are not correlated
7. Coefficient of Determination
o this calculates the proportion of the variation in the
actual values which can be predicted by changes in
the values of the independent variable
o denoted by , the square of the coefficient of
correlation

o ranges from 0 to 1 (r ranges from -1 to +1)

o expressed as a percentage, it represents the
proportion that can be predicted by the regression line
o the value 1 - is therefore the proportion contributed
by other factors
8. Standard Error of Estimate (SEE)
o a measure of the variability of the regression line, i.e.
the dispersion around the regression line
o it tells how much variation there is in the dependent
variable between the raw value and the expected
value in the regression

o this SEE allows us to generate the confidence interval

on the regression line as we did in the estimation of
means
9. Confidence interval for the regression line (estimating
the expected value)
o estimating the mean value of for a given value of X
is a very important practical problem
o e.g. if a corporation's profit Y is linearly related to its
advertising expenditures X, the corporation may want
to estimate the mean profit for a given expenditure X
o this is given by the formula
mca-5 379

o at n-2 degrees of freedom for the t-distribution

10. Confidence interval for individual prediction
o for technical reason, the above formula must be
amended and is given by

An Example

Accounting Statistics
X2 Y2 XY
X Y
1 74.00 81.00 5476.00 6561.00 5994.00
2 93.00 86.00 8649.00 7396.00 7998.00
3 55.00 67.00 3025.00 4489.00 3685.00
4 41.00 35.00 1681.00 1225.00 1435.00
5 23.00 30.00 529.00 900.00 690.00
6 92.00 100.00 8464.00 10000.00 9200.00
7 64.00 55.00 4096.00 3025.00 3520.00
8 40.00 52.00 1600.00 2704.00 2080.00
9 71.00 76.00 5041.00 5776.00 5396.00
10 33.00 24.00 1089.00 576.00 792.00
11 30.00 48.00 900.00 2304.00 1440.00
12 71.00 87.00 5041.00 7569.00 6177.00
Sum 687.00 741.00 45591.00 52525.00 48407.00
Mean 57.25 61.75 3799.25 4377.08 4033.92
mca-5 380

Figure 1: Scatter Diagram of Raw Data

mca-5 381

Figure 2: Scatter Diagram and Regression Line

Interpretation/Conclusion

There is a linear relation between the results of Accounting and

Statistics as shown from the scatter diagram in Figure 1. A linear
regression analysis was done using the least-square method. The
resultant regression line is represented by in
which X represents the results of Accounting and Y that of
Statistics. Figure 2 shows the regression line. In this example, the
choice of dependent and independent variables is arbitrary. It can
be said that the results of Statistics are correlated to that of
Accounting or vice versa.

The Coefficient of Determination is 0.8453. This shows that the

two variables are correlated. Nearly 85% of the variation in Y is
explained by the regression line.

The Coefficient of Correlation (r) has a value of 0.92. This indicates

that the two variables are positively correlated (Y increases as X
increases).
mca-5 382

Method of the Least Square

To a statistician, the line will have a good feet if it minimizes the

error between the estimated points on the line and the actual
observed points that were used to draw it.

One way we can measure the error of our estimating line is to sum
all the individual differences, or errors, between the estimated
points. Let be the individual values of the estimated points and Y
be the raw data values.

Figure 1 shows an example.

Figure 1

Y diff
8 6 2
1 5 -4
6 4 2
total error 0

Figure 2 shows another example.

mca-5 383

Figure 2

Y diff
8 2 6
1 5 -4
6 8 -2
total error 0

It also has a zero sum of error as shown from the above table.

A visual comparison between the two figures shows that the

regression line in Figure 1 fits the three data points better than the
ine in Figure 2. However the process of summing the individual
differences in the above 2 tables indicates that both lines describe
the data equally well. Therefore we can conclude that the process
of summing individual differences for calculating the error is not a
reliable way to judge the goodness of fit of an estimating line.

The problem with adding the individual errors is the canceling effect
of the positive and negative values. From this, we might deduce
that the proper criterion for judging the goodness of fit would be to
add the absolute values of each error. The following table shows a
comparison between the absolute values of Figure 1 and Figure 2.
mca-5 384

Figure 1 Figure 2
Y abs. diff Y abs. diff
8 6 2 8 2 6
1 5 4 1 5 4
6 4 2 6 8 2
total error 8 total error 12

Since the absolute error for Figure 1 is smaller than that for Figure
2, we have confirmed our intuitive impression that the estimating
line in Figure 1 is the better fit.

Figure 3 and Figure 4 below show another scenarios.

Figure 3
mca-5 385

Figure 4

The following table shows the calculations of absolute values of the

errors.

Figure 3 Figure 4
Y abs. diff Y abs. diff
4 4 0 4 5 1
7 3 4 7 4 3
2 2 0 2 3 1
total error 4 total error 5
proper

We have added the absolute values of the errors and found that the
estimating line in Figure 3 is a better fit than the line in Figure 4.
Intuitively, however, it appears that the line in Figure 4 is the better
fit line, because it has been moved vertically to take the middle
point into consideration. Figure 3, on the other hand, seems to
ignore the middle point completely.

Because The sum of the absolute values does not stress the
magnitude of the error.

In effect, we want to find a way to penalize large absolute errors, so that

we can avoid them. We can accomplish this if we square the individual
errors before we add them. Squaring each term accomplishes two goals:
mca-5 386

It magnifies, or penalizes, the larger errors.

1. It cancels the effect of the positive and negative values (a

negative error squared is still positive).

Figure 3 Figure 4
Y abs diff square diff Y abs diff square diff
4 4 0 0 4 5 1 1
7 3 4 16 7 4 3 9
2 2 0 0 2 3 1 1
sum of squares 16 sum of squares 11

Applying the Least Square Criterion to the Estimating Lines (Fig 3 &
4)

Since we are looking for the estimating line that minimizes the sum
of the squares of the errors, we call this the Least Squares
Method.




mca-5 387

4
STATISTICS - DISPERSON
Frequency distribution

In statistics, a frequency distribution is a list of the values that a

variable takes in a sample. It is usually a list, ordered by quantity,
showing the number of times each value appears. For example, if
100 people rate a five-point Likert scale assessing their agreement
with a statement on a scale on which 1 denotes strong agreement
and 5 strong disagreement, the frequency distribution of their
responses might look like:

Rank Degree of agreement Number

1 Strongly agree 25
2 Agree somewhat 35
3 Not sure 20
4 Disagree somewhat 15
5 Strongly disagree 30

This simple tabulation has two drawbacks. When a variable can

take continuous values instead of discrete values or when the
number of possible values is too large, the table construction is
cumbersome, if it is not impossible. A slightly different tabulation
scheme based on the range of values is used in such cases. For
example, if we consider the heights of the students in a class, the
frequency table might look like below.

Height range Number of students Cumulative Number

4.5 -5.0 feet 25 25
5.0-5.5 feet 35 60
5.5-6 feet 20 80
6.0-6.5 feet 20 100

Applications

Managing and operating on frequency tabulated data is much

simpler than operation on raw data. There are simple algorithms to
calculate median, mean, standard deviation etc. from these tables.

Statistical hypothesis testing is founded on the assessment of

differences and similarities between frequency distributions. This
mca-5 388

assessment involves measures of central tendency or averages,

such as the mean and median, and measures of variability or
statistical dispersion, such as the standard deviation or variance.

A frequency distribution is said to be skewed when its mean and

median are different. The kurtosis of a frequency distribution is the
concentration of scores at the mean, or how peaked the distribution
appears if depicted graphically—for example, in a histogram. If the
distribution is more peaked than the normal distribution it is said to
be leptokurtic; if less peaked it is said to be platykurtic.

Frequency distributions are also used in frequency analysis to

crack codes and refer to the relative frequency of letters in different
languages.

Class Interval

In statistics, the range of each class of data, used when arranging

large amounts of raw data into grouped data. To obtain an idea of
the distribution, the data are broken down into convenient classes
(commonly 6–16), which must be mutually exclusive and are
usually equal in width to enable histograms to be drawn. The class
boundaries should clearly define the range of each class. When
dealing with discrete data, suitable intervals would be, for example,
0–2, 3–5, 6–8, and so on. When dealing with continuous data,
suitable intervals might be 170 ≤ X < 180, 180 ≤ X < 190, 190 ≤ X <
200, and so on.

Cross tabulation

A cross tabulation (often abbreviated as cross tab) displays the joint

distribution of two or more variables. They are usually presented as
a contingency table in a matrix format. Whereas a frequency
distribution provides the distribution of one variable, a contingency
table describes the distribution of two or more variables
simultaneously. Each cell shows the number of respondents that
gave a specific combination of responses, that is, each cell
contains a single cross tabulation.

The following is a fictitious example of a 3 × 2 contingency table.

The variable “Wikipedia usage” has three categories: heavy user,
light user, and non user. These categories are all inclusive so the
columns sum to 100%. The other variable "underpants" has two
categories: boxers, and briefs. These categories are not all
inclusive so the rows need not sum to 100%. Each cell gives the
percentage of subjects that share that combination of traits.
mca-5 389

boxers briefs

heavy Wiki user 70% 5%

light Wiki user 25% 35%

non Wiki user 5% 60%

Cross tabs are frequently used because:

1. They are easy to understand. They appeal to people that do

not want to use more sophisticated measures.
2. They can be used with any level of data: nominal, ordinal,
interval, or ratio - cross tabs treat all data as if it is nominal
3. A table can provide greater insight than single statistics
4. It solves the problem of empty or sparse cells
5. they are simple to conduct

Statistics related to cross tabulations

The following list is not comprehensive.

Chi-square - This tests the statistical significance of the cross

tabulations. Chi-squared should not be calculated for percentages.
The cross tabs must be converted back to absolute counts
(numbers) before calculating chi-squared. Chi-squared is also
problematic when any cell has a joint frequency of less than five.
For an in-depth discussion of this issue see Fienberg, S.E. (1980).
"The Analysis of Cross-classified Categorical Data." 2nd Edition.
M.I.T. Press, Cambridge, MA.

 Contingency Coefficient - This tests the strength of association

of the cross tabulations. It is a variant of the phi coefficient that
adjusts for statistical significance. Values range from 0 (no
association) to 1 (the theoretical maximum possible
association).
 Cramer’s V - This tests the strength of association of the cross
tabulations. It is a variant of the phi coefficient that adjusts for
the number of rows and columns. Values range from 0 (no
association) to 1 (the theoretical maximum possible
association).
 Lambda Coefficient - This tests the strength of association of
the cross tabulations when the variables are measured at the
nominal level. Values range from 0 (no association) to 1 (the
theoretical maximum possible association). Asymmetric lambda
measures the percentage improvement in predicting the
dependent variable. Symmetric lambda measures the
mca-5 390

percentage improvement when prediction is done in both

directions.
 phi coefficient - If both variables instead are nominal and
dichotomous, phi coefficient is a measure of the degree of
association between two binary variables. This measure is
similar to the correlation coefficient in its interpretation. Two
binary variables are considered positively associated if most of
the data falls along the diagonal cells. In contrast, two binary
variables are considered negatively associated if most of the
data falls off the diagonal.
 Kendall tau:
o Tau b - This tests the strength of association of the cross
tabulations when both variables are measured at the ordinal
level. It makes adjustments for ties and is most suitable for
square tables. Values range from -1 (100% negative
association, or perfect inversion) to +1 (100% positive
association, or perfect agreement). A value of zero indicates the
absence of association.
o Tau c - This tests the strength of association of the cross
tabulations when both variables are measured at the ordinal
level. It makes adjustments for ties and is most suitable for
rectangular tables. Values range from -1 (100% negative
association, or perfect inversion) to +1 (100% positive
association, or perfect agreement). A value of zero indicates the
absence of association.
 Gamma - This tests the strength of association of the cross
tabulations when both variables are measured at the ordinal
level. It makes no adjustment for either table size or ties. Values
range from -1 (100% negative association, or perfect inversion)
to +1 (100% positive association, or perfect agreement). A value
of zero indicates the absence of association.
 Uncertainty coefficient, entropy coefficient or Theil's U

Purpose Of A HistogramA histogram is used to graphically

summarize and display the distribution of a process data set.

Sample Bar Chart Depiction

mca-5 391

How To Construct A Histogram

A histogram can be constructed by segmenting the range of the
data into equal sized bins (also called segments, groups or
classes). For example, if your data ranges from 1.1 to 1.8, you
could have equal bins of 0.1 consisting of 1 to 1.1, 1.2 to 1.3, 1.3 to
1.4, and so on.

The vertical axis of the histogram is labeled Frequency (the number

of counts for each bin), and the horizontal axis of the histogram is
labeled with the range of your response variable.

You then determine the number of data points that reside within
each bin and construct the histogram. The bins size can be defined
by the user, by some common rule, or by software methods (such
as Minitab).

What Questions The Histogram Answers

What is the most common system response?

What distribution (center, variation and shape) does the data
have?
Does the data look symmetric or is it skewed to the left or right?
Does the data contain outliers?

Geometric mean

The geometric mean, in mathematics, is a type of mean or average,

which indicates the central tendency or typical value of a set of
numbers. It is similar to the arithmetic mean, which is what most
people think of with the word "average," except that instead of
adding the set of numbers and then dividing the sum by the count
of numbers in the set, n, the numbers are multiplied and then the
nth root of the resulting product is taken.

For instance, the geometric mean of two numbers, say 2 and 8, is

just the square root (i.e., the second root) of their product, 16,
which is 4. As another example, the geometric mean of 1, ½, and ¼
is simply the cube root (i.e., the third root) of their product, 0.125,
which is ½.

The geometric mean can be understood in terms of geometry. The

geometric mean of two numbers, a and b, is simply the side length
of the square whose area is equal to that of a rectangle with side
lengths a and b. That is, what is n such that n² = a × b? Similarly,
the geometric mean of three numbers, a, b, and c, is the side length
of a cube whose volume is the same as that of a rectangular prism
with side lengths equal to the three given numbers. This geometric
interpretation of the mean is very likely what gave it its name.
mca-5 392

The geometric mean only applies to positive numbers.[1] It is also

often used for a set of numbers whose values are meant to be
multiplied together or are exponential in nature, such as data on the
growth of the human population or interest rates of a financial
investment. The geometric mean is also one of the three classic
Pythagorean means, together with the aforementioned arithmetic
mean and the harmonic mean.

Calculation

The geometric mean of a data set [a1, a2, ..., an] is given by

The geometric mean of a data set is smaller than or equal to the

data set's arithmetic mean (the two means are equal if and only if
all members of the data set are equal). This allows the definition of
the arithmetic-geometric mean, a mixture of the two which always
lies in between.

The geometric mean is also the arithmetic-harmonic mean in the

sense that if two sequences (an) and (hn) are defined:

and

then an and hn will converge to the geometric mean of x and y.

Relationship with arithmetic mean of logarithms

By using logarithmic identities to transform the formula, we can

express the multiplications as a sum and the power as a
multiplication.

This is sometimes called the log-average. It is simply computing the

arithmetic mean of the logarithm transformed values of xi (i.e. the
arithmetic mean on the log scale) and then using the
mca-5 393

exponentiation to return the computation to the original scale. I.e., it

is the generalised f-mean with f(x) = ln x.

Therefore the geometric mean is related to the log-normal

distribution. The log-normal distribution is a distribution which is
normal for the logarithm transformed values. We see that the
geometric mean is the exponentiated value of the arithmetic mean
of the log transformed values.

Harmonic mean

The harmonic mean of numbers (where , ..., ) is

the number defined by

The harmonic mean of a list of numbers may be computed in

Mathematica using HarmonicMean[list].

The special cases of and are therefore given by

and so on.

The harmonic means of the integers from 1 to for , 2, ... are 1,

4/3, 18/11, 48/25, 300/137, 120/49, 980/363, ... (Sloane's A102928
and A001008).

For , the harmonic mean is related to the arithmetic mean

and geometric mean by

(Havil 2003, p. 120).

The harmonic mean is the special case of the power mean and
is one of the Pythagorean means. In older literature, it is sometimes
called the subcontrary mean.
mca-5 394

The volume-to-surface area ratio for a cylindrical container with

height and radius and the mean curvature of a general surface
are related to the harmonic mean.

Hoehn and Niven (1985) show that

for any positive constant .

Measures of Variability - Variance and Standard Deviation

The measures of central tendency discussed in the last section are

useful because data tend to cluster around central values.
However, as the individual values in a distribution of data differ
from each other, central values provide us only an incomplete
picture of the features of the distribution.

To obtain a more complete picture of the nature of a distribution,

the variability (or dispersion or spread) of the data needs to be
considered.

The measures of variability for data that we look at are: the range,
the mean deviation and the standard deviation.

Range for a Set of Data

Example 4.4.1

When Arnie started attending Fanshawe College, he was keen to

arrive at school on time so he kept a record of his travel times to
get to school each day for the first ten days. The number of minutes
taken each day was:

55 69 93 59 68 75 62 78 97 83 .

This data can be rearranged in ascending order:

55 59 62 68 69 75 78 83 93 97

The range for a set of data is defined as:

range = maximum - minimum

Arnie’s range of travel times = 97 - 55 = 42 minutes.

Interquartile Ranges
mca-5 395

We defined the interquartile range for a set of data earlier. The

quartiles Q1, Q2 and Q3 divide a body of data into four equal parts.
Q1 is called the lower quartile and contains the lower 25% of the
data.
Q2 is the median
Q3 is called the upper quartile and contains the upper 25% of the
data.

The interquartile range is Q3 - Q1 and is called the IQR. The IQR

contains the middle 50% of the data.

A list of values must be written in order from least to greatest before

the quartile values can be determined. If the quartile division
comes between two values, the quartile value is the average of the
two.

Arnie's travel times in ascending order are:

55 59 62 68 68 75 78 83 83 97
Q3 divide a body of data into four equal parts.

Q1 is called the lower quartile and contains the lower 25% of the
data.
Q2 is the median
Q3 is called the upper quartile and contains the upper 25% of the
data.

The interquartile range is Q3 - Q1 and is called the IQR. The IQR

contains the middle 50% of the data.

A list of values must be written in order from least to greatest before

the quartile values can be determined. If the quartile division
comes between two values, the quartile value is the average of the
two.

Arnie's travel times in ascending order are:

55 59 62 68 68 75 78 83 83 97

To find the value of Q1, first find its position. There are 10 data
items, so the position for Q1 is 10/4 = 2.5. Since this is between
items 2 and 3 we take the average of items 2 and 3.

Therefore, Q1 = 60.5.
mca-5 396

Similarly the median is the average of 68 and 75. The median is

71.5.

The position of Q3 from the upper end of the data is again 2.5. The
average of 83 and 83 is 83.

In summary we have: Q1 = 60.5, Q2 = 71.5 and Q3 = 83.

Arnie's average or mean travel time is 72.8 minutes.

The term deviation refers to the difference between the value of an

individual observation in a data set and the mean of the data set.
Since the mean is a central value, some of the deviations will be
positive and some will be negative. The sum of all of the deviations
from the mean is always zero. Therefore, the absolute value of the
deviations from the mean are used to calculate the mean deviation
for a set of data.

The mean deviation for a set of data is the average of the absolute
values of all the deviations.

Example 4.4.2

Compute the mean deviation for the traveling times for Arnie in
Example 4.1.1

The mean deviation for Arnie's traveling times is computed in the

table below:

Commuting Time, Deviation from Absolute Deviation from

x, in minutes Mean the Mean
55 -17.8 17.8
59 -13.8 13.8
62 -10.8 10.8
68 -4.8 4.8
68 -4.8 4.8
75 2.2 2.2
78 5.2 5.2
83 10.2 10.2
83 10.2 10.2
97 24.2 24.2
Sum of deviations = Sum of Absolute
0 Deviations = 104
mean = 72.8
mca-5 397

minutes

Mean Deviation = Sum of Absolute Deviations/n = 104/10 = 10.4

On the average, each travel time varies 10.4 minutes from the
mean travel time of 72.8 minutes.
The mean deviation tells us that, on average, Arnie’s commuting
time is 72.8 minutes and that, on average, each commuting time
deviates from the mean by 10.4 minutes.

If classes start at 8:00 a.m., can you suggest a time at which Arnie
should leave each day?
The Variance and Standard Deviation
The variance is the sum of the squares of the mean deviations
divided by n where n is the number of items in the data.

The standard deviation is the square root of the variance. It is the

most widely used measure of variability for a set of data.

The formulas for variance and for standard deviation are given:

Example 4.4.3
Compute the variance and standard deviation for Arnie's traveling
times as given in Example 4.4.1.

Solution:
Commuting Times - Arnie from Home to Fanshawe

Commuting Time Deviation from Mean Deviation

(minutes) Squared

55 -17.8 316.84
59 -13.8 190.44
62 -10.8 116.64
68 -4.8 23.04
68 -4.8 23.04
75 2.2 4.84
mca-5 398

78 5.2 27.04
83 10.2 104.04
83 10.2 104.04
97 24.2 585.64
Sum
mean = 72.8
=1495.6
Sample Variance
Sample Standard Deviation
s2 = 1495.6/10 = 149.56

problem4.4

The following data gives the number of home runs that Babe Ruth
hit in each of his 15 years with the New York Yankees baseball
team from 1920 to 1934:

54 59 35 41 46 25 47 60 54 46 49 46 41 34 22

The following are the number of home runs that Roger Maris hit in
each of the ten years he played in the major leagues from 1957 on
[data are already arrayed]:

8 13 14 16 23 26 28 33 39 61

Analyze the data in terms of consistency of performance.

Calculate the mean and standard deviation for each player's data
and comment on the consistency of performance of each player.

Mean, Median, Mode, and Range

Mean, median, and mode are three kinds of "averages". There are
many "averages" in statistics, but these are, I think, the three most
common, and are certainly the three you are most likely to
encounter in your pre-statistics courses, if the topic comes up at all.

The "mean" is the "average" you're used to, where you add up all
the numbers and then divide by the number of numbers. The
"median" is the "middle" value in the list of numbers. To find the
median, your numbers have to be listed in numerical order, so you
may have to rewrite your list first. The "mode" is the value that
mca-5 399

occurs most often. If no number is repeated, then there is no mode

for the list.

The "range" is just the difference between the largest and smallest
values.

 Find the mean, median, mode, and range for the following
list of values:

13, 18, 13, 14, 13, 16, 14, 21, 13

The mean is the usual average, so:

(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15

Note that the mean isn't a value from the original list. This is
a common result. You should not assume that your mean
will be one of your original numbers.

The median is the middle value, so I'll have to rewrite the list
in order:

13, 13, 13, 13, 14, 14, 16, 18, 21

There are nine numbers in the list, so the middle one will be
the (9 + 1) ÷ 2 = 10 ÷ 2 = 5th number:

13, 13, 13, 13, 14, 14, 16, 18, 21

So the median is 14.

The mode is the number that is repeated more often than

any other, so 13 is the mode.

The largest value in the list is 21, and the smallest is 13, so
the range is 21 – 13 = 8.

mean: 15
median: 14
mode: 13
range: 8

Note: The formula for the place to find the median is "( [the number
of data points] + 1) ÷ 2", but you don't have to use this formula. You
can just count in from both ends of the list until you meet in the
middle, if you prefer. Either way will work.
mca-5 400

 Find the mean, median, mode, and range for the following
list of values:

1, 2, 4, 7

The mean is the usual average: (1 + 2 + 4 + 7) ÷ 4 = 14 ÷ 4

= 3.5

The median is the middle number. In this example, the

numbers are already listed in numerical order, so I don't
have to rewrite the list. But there is no "middle" number,
because there are an even number of numbers. In this case,
the median is the mean (the usual average) of the middle
two values: (2 + 4) ÷ 2 = 6 ÷ 2 = 3

The mode is the number that is repeated most often, but all
the numbers appear only once. Then there is no mode.

The largest value is 7, the smallest is 1, and their difference

is 6, so the range is 6.

mean: 3.5
median: 3
mode: none
range: 6

The list values were whole numbers, but the mean was a decimal
value. Getting a decimal value for the mean (or for the median, if
you have an even number of data points) is perfectly okay; don't
round your answers to try to match the format of the other numbers.

 Find the mean, median, mode, and range for the following
list of values:

8, 9, 10, 10, 10, 11, 11, 11, 12, 13

The mean is the usual average:

(8 + 9 + 10 + 10 + 10 + 11 + 11 + 11 + 12 + 13) ÷ 10
= 105 ÷ 10 = 10.5

The median is the middle value. In a list of ten values, that

will be the (10 + 1) ÷ 2 = 5.5th value; that is, I'll need to
average the fifth and sixth numbers to find the median:

(10 + 11) ÷ 2 = 21 ÷ 2 = 10.5

mca-5 401

The mode is the number repeated most often. This list has
two values that are repeated three times.

The largest value is 13 and the smallest is 8, so the range is

13 – 8 = 5.

mean: 10.5
median: 10.5
modes: 10 and 11
range: 5

While unusual, it can happen that two of the averages (the mean
and the median, in this case) will have the same value.

Note: Depending on your text or your instructor, the above data set
;may be viewed as having no mode (rather than two modes), since
no single solitary number was repeated more often than any other.
I've seen books that go either way; there doesn't seem to be a
consensus on the "right" definition of "mode" in the above case. So
if you're not certain how you should answer the "mode" part of the
above example, ask your instructor before the next test.

About the only hard part of finding the mean, median, and mode is
keeping straight which "average" is which. Just remember the
following:

mean: regular meaning of "average"

median: middle value
mode: most often

(In the above, I've used the term "average" rather casually. The
technical definition of "average" is the arithmetic mean: adding up
the values and then dividing by the number of values. Since you're
probably more familiar with the concept of "average" than with
"measure of central tendency", I used the more comfortable term.)

 A student has gotten the following grades on his tests: 87,

95, 76, and 88. He wants an 85 or better overall. What is the
minimum grade he must get on the last test in order to
achieve that average?

The unknown score is "x". Then the desired average is:

(87 + 95 + 76 + 88 + x) ÷ 5 = 85

Multiplying through by 5 and simplifying, I get:

mca-5 402

87 + 95 + 76 + 88 + x = 425
346 + x = 425
x = 79

He needs to get at least a 79 on the last test.

Have you understood?

Q1.The number of cars sold by each of the 10 salespeople in an

automobile dealership during a particular month, arranged in
ascending order is :2,4,7,10,10,10,12,12,14,15.Determine the
a)range b)interquartile range and c)middle 80 percent range for
these data.

Q2.The weights of a sample of outgoing packages in a mailroom,

weighted to the nearest ounce, are found to
be:21,18,30,12,14,17,28,10,16,25 oz.Determine the a)range
and b)interquartile range for these weights.

Q3.The number of accidents which occurred during a given month

in the 13 manufacturing departments of an industrial plant was
:2,0,0,3,3,12,1,0,8,1,0,5,1.Determine the a)range and
b)interquartile range for the number of accidents.





mca-5 403

5
STATISTICS - PROPORTION
Estimation

Introduction

Quality assurance manager may be interested in estimating the

proportion defective of the finished product before shipment to the
customer. Manager of the credit department needs to estimate the
average collection period for collecting dues from the customers.
How confident are they in their estimates? An estimator is a sample
statistic used to estimate a population parameter. For example,
sample mean is the estimator of population mean. Sample
proportion is the estimator of population proportion .An estimate is
a specific observed value of a statistic.

Learning objectives

After reading this unit, you will be able to:

Define and compute point estimation

Define and compute interval estimation

Determine sample size based on confidence interval

Criteria for selecting an estimator

Unbiasedness: An estimator is unbiased if its expected value

equals the parameter for all sample sizes.

Relative efficiency: An unbiased estimator is relatively efficient

when its s.E., is smaller than that of another unbiased estimator of
the same parameter.

Consistency: An estimator is consistent if the probability that its

value is very near the parameter’s value which increasingly
approaches 1 as the sample size increases.
mca-5 404

Sufficiency: an estimator is sufficient if it makes so much use of the

information in the sample that no other estimator could extract from
the sample additional information about the population parameter
being estimated.

Types of estimation:

There are two types of estimation. They are

a) Point estimation

b) Interval estimation

A point estimate is a single valued estimate. For example,

estimation of the population mean to be 410 is equal to the sample
mean.

An interval estimate is an estimate that is a range of values. For

example, estimation of the population means to be 400 to 420.

Sample Proportions and Point Estimation

Sample Proportions

Let p be the proportion of successes of a sample from a population

whose total proportion of successes is  and let p be the mean of
p and p be its standard deviation.

Then

The Central Limit Theorem For Proportions

1. p = 

3. For n large, p is approximately normal.

Example

Consider the next census. Suppose we are interested in the

proportion of Americans that are below the poverty level. Instead of
attempting to find all Americans, Congress has proposed to perform
statistical sampling. We can concentrate on 10,000 randomly
mca-5 405

selected people from 1000 locations. We can determine the

proportion of people below the poverty level in each of these
regions. Suppose this proportion is .08. Then the mean for the
sampling distribution is

p = 0.8

and the standard deviation is

Point Estimations

A Point Estimate is a statistic that gives a plausible estimate for the

value in question.

Example

x is a point estimate for 

s is a point estimate for 

A point estimate is unbiased if its mean represents the value that it

is estimating.

Confidence Intervals for a Mean

Point Estimations

Usually, we do not know the population mean and standard

deviation. Our goal is to estimate these numbers. The standard
way to accomplish this is to use the sample mean and standard
deviation as a best guess for the true population mean and
standard deviation. We call this "best guess" a point estimate.

A Point Estimate is a statistic that gives a plausible estimate for the

value in question.

Example:

x is a point estimate for 

s is a point estimate for 

A point estimate is unbiased if its mean represents the value that it

is estimating.
mca-5 406

Confidence Intervals

We are not only interested in finding the point estimate for the
mean, but also determining how accurate the point estimate is.
The Central Limit Theorem plays a key role here. We assume that
the sample standard deviation is close to the population standard
deviation (which will almost always be true for large samples).
Then the Central Limit Theorem tells us that the standard deviation
of the sampling distribution is

We will be interested in finding an interval around x such that there

is a large probability that the actual mean falls inside of this
interval. This interval is called a confidence interval and the large
probability is called the confidence level.

Example

Suppose that we check for clarity in 50 locations in Lake Tahoe and

discover that the average depth of clarity of the lake is 14 feet with
a standard deviation of 2 feet. What can we conclude about the
average clarity of the lake with a 95% confidence level?

Solution

We can use x to provide a point estimate for  and s to provide a

point estimate for . How accurate is x as a point estimate? We
construct a 95% confidence interval for  as follows. We draw the
picture and realize that we need to use the table to find the z-score
associated to the probability of .025 (there is .025 to the left and
.025 to the right).

We arrive at z = -1.96. Now we solve for x:

x - 14 x - 14
-1.96 = =
2/ 0.28
mca-5 407

Hence

x - 14 = -0.55

We say that 0.55 is the margin of error.

We have that a 95% confidence interval for the mean clarity is

(13.45,14.55)

In other words there is a 95% chance that the mean clarity is

between 13.45 and 14.55.

In general if zc is the z value associated with c% then a c%

confidence interval for the mean is

Confidence Interval for a Small Sample

When the population is normal the sampling distribution will also be

normal, but the use of s to replace  is not that accurate. The
smaller the sample size the worse the approximation will be.
Hence we can expect that some adjustment will be made based on
the sample size. The adjustment we make is that we do not use
the normal curve for this approximation. Instead, we use the
Student t distribution that is based on the sample size. We proceed
as before, but we change the table that we use. This distribution
looks like the normal distribution, but as the sample size decreases
it spreads out. For large n it nearly matches the normal curve. We
say that the distribution has n - 1 degrees of freedom.

Example

Suppose that we conduct a survey of 19 millionaires to find out

what percent of their income the average millionaire donates to
charity. We discover that the mean percent is 15 with a standard
deviation of 5 percent. Find a 95% confidence interval for the mean
percent.

Solution

We use the formula:

(Notice the t instead of the z)

mca-5 408

We get

15 tc 5 /

Since n = 19, there are 18 degrees of freedom. Using the table in

the back of the book, we have that

tc = 2.10

Hence the margin of error is

2.10 (5) / = 2.4

We can conclude with 95% confidence that the millionaires donate

between

12.6% and 17.4% of their income to charity.

Confidence Intervals For Proportions and

Choosing the Sample Size

A Large Sample Confidence Interval for a Population

Proportion

Recall that a confidence interval for a population mean is given by

Confidence Interval for a Population Mean

zc s
x

We can make a similar construction for a confidence interval for a

population proportion. Instead of x, we can use p and instead of s,
we use , hence, we can write the confidence interval for a
large sample proportion as

Confidence Interval Margin of Error for a Population Proportion

mca-5 409

Example

1000 randomly selected Americans were asked if they believed the

minimum wage should be raised. 600 said yes. Construct a 95%
confidence interval for the proportion of Americans who believe that
the minimum wage should be raised.

Solution:

We have

p = 600/1000 = .6 zc = 1.96 and n = 1000

We calculate:

Hence we can conclude that between 57 and 63 percent of all

Americans agree with the proposal. In other words, with a margin of
error of .03 , 60% agree.

Calculating n for Estimating a Mean

Example

Suppose that you were interested in the average number of units

that students take at a two year college to get an AA degree.
Suppose you wanted to find a 95% confidence interval with a
margin of error of .5 for  knowing  = 10. How many people
should we ask?

Solution

Solving for n in

Margin of Error = E = zc /

we have

E = zc

zc
=
E
mca-5 410

Squaring both sides, we get

We use the formula:

Example

A Subaru dealer wants to find out the age of their customers (for
advertising purposes). They want the margin of error to be 3 years
old. If they want a 90% confidence interval, how many people do
they need to know about?

Solution:

We have

E = 3, zc = 1.65

but there is no way of finding sigma exactly. They use the following
reasoning: most car customers are between 16 and 68 years old
hence the range is

Range = 68 - 16 = 52

The range covers about four standard deviations hence one

standard deviation is about

  52/4 = 13

We can now calculate n:

mca-5 411

Hence the dealer should survey at least 52 people.

Finding n to Estimate a Proportion

Example

Suppose that you are in charge to see if dropping a computer will

damage it. You want to find the proportion of computers that break.
If you want a 90% confidence interval for this proportion, with a
margin of error of 4%, How many computers should you drop?

Solution

The formula states that

Squaring both sides, we get that

zc2p(1 - p)
E2 =
n

Multiplying by n, we get

nE2 = zc2[p(1 - p)]

This is the formula for finding n.

Since we do not know p, we use .5 ( A conservative estimate)

We round 425.4 up for greater

accuracy
mca-5 412

We will need to drop at least 426 computers. This could get

expensive.

Estimating Differences

Difference Between Means

I surveyed 50 people from a poor area of town and 70 people from

an affluent area of town about their feelings towards minorities. I
counted the number of negative comments made. I was interested
in comparing their attitudes. The average number of negative
comments in the poor area was 14 and in the affluent area was 12.
The standard deviations were 5 and 4 respectively. Let's
determine a 95% confidence for the difference in mean negative
comments. First, we need some formulas.

Theorem

The distribution of the difference of means x1 - x2 has mean

1 - 2

and standard deviation

For our investigation, we use s1 and s2 as point estimates for 1

and 2. We have

x1 = 14 x2 = 12 s1 = 5 s2 = 4 n1 =
50 n2 = 70

Now calculate

x1 - x2 = 14 - 12 = 2

The margin of error is

E = zcs = (1.96)(0.85) = 1.7

mca-5 413

The confidence interval is

2 1.7

[0.3, 3.7]

We can conclude that the mean difference between the number of

racial slurs that poor and wealthy people make is between 0.3 and
3.7.

Note: To calculate the degrees of freedom, we can take the

smaller of the two numbers n1 - 1 and n2 - 1. So in the prior
example, a better estimate would use 49 degrees of freedom. The
t-table gives a value of 2.014 for the t.95 value and the margin of
error is

E = zcs = (2.014)(0.85) = 1.7119

which still rounds to 1.7. This is an example that demonstrates that

using the t-table and z-table for large samples results in practically
the same results.

Small Samples With Pooled Pooled Standard Deviations

(Optional)

When either sample size is small, we can still run the statistics
provided the distributions are approximately normal. If in addition
we know that the two standard deviations are approximately equal,
then we can pool the data together to produce a pooled standard
deviation. We have the following theorem.

Pooled Estimate of 

with n1 + n2 - 2 degrees of freedom

You've gotta love the beautiful formula!

Note
mca-5 414

After finding the pooled estimate we have that a confidence interval

is given by

Example

What is the difference between commuting patterns for students

and professors. 11 students and 14 professors took part in a study
to find mean commuting distances. The mean number of miles
traveled by students was 5.6 and the standard deviation was 2.8.
The mean number of miles traveled by professors was 14.3 and the
standard deviation was 9.1. Construct a 95% confidence interval
for the difference between the means. What assumption have we
made?

Solution

We have

x1 = 5.6 x2 = 14.3 s1 = 2.8 s2 = 9.1

n1 = 11 n2 = 14

The pooled standard deviation is

The point estimate for the mean is

14.3 - 5.6 = 8.7

and

Use the t-table to find tc for a 95% confidence interval with 23

degrees of freedom and find

tc = 2.07

8.7 (2.07)(7.09)(.403) = 8.7 5.9

mca-5 415

The range of values is [2.8, 14.6]

The difference in average miles driven by students and professors

is between 2.8 and 14.6. We have assumed that the standard
deviations are approximately equal and the two distributions are
approximately normal.

Difference Between Proportions

So far, we have discussed the difference between two means (both

large and small samples). Our next task is to estimate the
difference between two proportions. We have the following
theorem

And a confidence interval for the difference of proportions is

Confidence Interval for the difference of Proportions

Note: in order for this to be valid, we need all four of the quantities

p1n1 p2n2 q1n1 q2n2

to be greater than 5.

Example

300 men and 400 women we asked how they felt about taxing
Internet sales. 75 of the men and 90 of the women agreed with
having a tax. Find a confidence interval for the difference in
proportions.

Solution

We have

p1 = 75/300 = .25 q1 = .75 n1 = 300

p2 = 90/400 = .225 q2 = .775 n2 = 400

We can calculate
mca-5 416

We can conclude that the difference in opinions is between -8.5%

and 3.5%.

Confidence interval for a mean when the population S.D

known

Solve:

As the owner of custom travel, you want to estimate the mean time
that it takes a travel agent to make the initial arrangements for
vacation pacakage.You have asked your office manager to take a
random sample of 40 vacation requests and to observe how long it
takes to complete the initial engagements. The office manager
reported a mean time of 23.4 min.You want to estimate the true
mean time using 95% confidence level. Previous time studies
indicate that the S.d of times is a relatively constant 9.8min
Solve:
A machine produces components which have a standard deviation
of 1.6cm in length. A random sample of 64 parts is selected from
the output and this sample has a mean length of 90cm.The
customer will reject the part if it is either less than 88cm or more
than 92 cm.Does the 95%confidence interval for the true mean
length of all components produced ensure acceptance by the
customer?

Confidence interval for the population proportion for large

samples

Solve:

In a health survey involving a random sample of 75 patients who

developed a particular illness, 70%of them are cured of this illness
by a new drug. Establish the 95% confidence interval for the
population proportion of all the patients who will be cured by the
new drug. This would help to assess the market potential for this
new drug by a pharmaceutical company.

Confidence interval for population means for small samples

Using T-distribution
Solve:

The average travel time taken based on a random sample of 10

people working in a company to reach the office is 40 minutes with
a standard deviation of 10 minutes.Establish the 95% confidence
interval for the mean travel time of everyone in the company and
re-design the working hours.

Determining the sample size using confidence interval

Solve
mca-5 417

A marketing manager of a fast food restaurant in a city wishes to

estimate the average yearly amount that families spend on fast
food restaurants. He wants the estimate to be within +or – Rs100
with a confidence level of 99%.It is known from an earlier pilot
study that the standard deviation of the family expenditure on fast
food restaurant is Rs500.How many families must be chosen for
this problem?

Sample size determination:-population proportion

Solve:

A company manufacturing sports goods wants to estimate the

proportion of cricket players among high school students in India.
The company wants the estimate to be within + or – 0.03 with a
confidence level of 99%.A pilot study done earlier reveals that out
of 80 high school students,36 students play cricket. What should be
the sample size for this study?
Summary

This unit has given a conceptual framework of statistical estimation.

In particular, this unit has focused on the following:

 The definition and meaning of point estimation for the

population mean and population proportion.
 The role of sample mean and sample proportion in
estimating the population mean and population proportion
with their property of unbiased ness.
 The conceptual framework of interval estimation with its key
elements.

 The methodology for establishing the confidence interval for

the population mean and the population proportion based on
the sample mean and the sample proportion.
 Examples giving the 95% and 99% confidence interval for
the population mean and the population proportion for large
samples.
 Establishing confidence interval for small samples using the t
distribution after explaining the role of degree of freedom in
computing the value of t.
 Determining the optimal sample size based on precision,
confidence level, and a knowledge about the population
standard deviation.



mca-5 418

6
STATISTICS - HYPOTHESIS
Hypothesis Testing

whenever we have a decision to make about a population

characteristic, we make a hypothesis. Some examples are:

 >3

 5.

Suppose that we want to test the hypothesis that  5. Then we

can think of our opponent suggesting that  = 5. We call the
opponent's hypothesis the null hypothesis and write:

H0: =5

and our hypothesis the alternative hypothesis and write

H1:  5

For the null hypothesis we always use equality, since we are

comparing  with a previously determined mean.

For the alternative hypothesis, we have the choices: < , > , or .

Procedures in Hypothesis Testing

When we test a hypothesis we proceed as follows:

Formulate the null and alternative hypothesis.

1. Choose a level of significance.

2. Determine the sample size. (Same as confidence intervals)

3. Collect data.
mca-5 419

4. Calculate z (or t) score.

5. Utilize the table to determine if the z score falls within the

acceptance region.

6. Decide to

a. Reject the null hypothesis and therefore accept the

alternative hypothesis or

b. Fail to reject the null hypothesis and therefore state

that there is not enough evidence to suggest the truth
of the alternative hypothesis.

Errors in Hypothesis Tests

We define a type I error as the event of rejecting the null hypothesis

when the null hypothesis was true. The probability of a type I error
() is called the significance level.

We define a type II error (with probability ) as the event of failing

to reject the null hypothesis when the null hypothesis was false.

Example

Suppose that you are a lawyer that is trying to establish that a

company has been unfair to minorities with regard to salary
increases. Suppose the mean salary increase per year is 8%.

You set the null hypothesis to be

H0:  = .08

H1:  < .08

Q. What is a type I error?

A. We put sanctions on the company, when they were not being

discriminatory.

Q. What is a type II error?

A. We allow the company to go about its discriminatory ways.

mca-5 420

Note: Larger  results in a smaller , and smaller  results in a

larger .

Hypothesis Testing For a Population Mean

The Idea of Hypothesis Testing

Suppose we want to show that only children have an average

higher cholesterol level than the national average. It is known that
the mean cholesterol level for all Americans is 190. Construct the
relevant hypothesis test:

H0:  = 190

H1:  > 190

We test 100 only children and find that

x = 198

and

s = 15.

Do we have evidence to suggest that only children have an

average higher cholesterol level than the national average? We
have

z is called the test statistic.

Since z is so high, the probability that Ho is true

is so small that we decide to reject H0 and
accept H1. Therefore, we can conclude that
only children have a higher cholesterol level on
the average then the national average.
mca-5 421

Rejection Regions

Suppose that  = .05. We can draw the appropriate picture and

find the z score for -.025 and .025. We call the outside regions the
rejection regions.

We call the blue areas the rejection region since if the value of z
falls in these regions, we can say that the null hypothesis is very
unlikely so we can reject the null hypothesis

Example

50 smokers were questioned about the number of hours they sleep

each day. We want to test the hypothesis that the smokers need
less sleep than the general public which needs an average of 7.7
hours of sleep. We follow the steps below.

Compute a rejection region for a significance level of .05.

A. If the sample mean is 7.5 and the standard deviation is .5,

what can you conclude?

Solution

First, we write write down the null and alternative hypotheses

H0:  = 7.7 H1:  < 7.7

This is a left tailed test. The z-score that corresponds to .05 is -

1.96. The critical region is the area that lies to the left of -1.96. If
the z-value is less than -1.96 there we will reject the null hypothesis
and accept the alternative hypothesis. If it is greater than -1.96, we
will fail to reject the null hypothesis and say that the test was not
statistically significant.

We have
mca-5 422

Since -2.83 is to the left of -1.96, it is in the critical region. Hence

we reject the null hypothesis and accept the alternative hypothesis.
We can conclude that smokers need less sleep.

p-values

There is another way to interpret the test statistic. In hypothesis

testing, we make a yes or no decision without discussing borderline
cases. For example with  = .06, a two tailed test will indicate
rejection of H0 for a test statistic of z = 2 or for z = 6, but z = 6 is
much stronger evidence than z = 2. To show this difference we
write the p-value which is the lowest significance level such that we
will still reject Ho. For a two tailed test, we use twice the table
value to find p, and for a one tailed test, we use the table value.

Example:

Suppose that we want to test the hypothesis with a significance

level of .05 that the climate has changed since industrializatoin.
Suppose that the mean temperature throughout history is 50
degrees. During the last 40 years, the mean temperature has been
51 degrees with a standard deviation of 2 degrees. What can we
conclude?

We have

H0:  = 50

H1:  50

We compute the z score:

The table gives us .9992

so that

p = (1 - .9992)(2) = .002

since

.002 < .05

mca-5 423

we can conclude that there has been a change in temperature.

Note that small p-values will result in a rejection of H0 and large p-

values will result in failing to reject H0.

Hypothesis Testing for a Proportion and

for Small Samples

Small Sample Hypothesis Tests For a Normal population

When we have a small sample from a normal population, we use

the same method as a large sample except we use the t statistic
instead of the z-statistic. Hence, we need to find the degrees of
freedom (n - 1) and use the t-table in the back of the book.

Example

Is the temperature required to damage a computer on the average

less than 110 degrees? Because of the price of testing, twenty
computers were tested to see what minimum temperature will
damage the computer. The damaging temperature averaged 109
degrees with a standard deviation of 3 degrees. (use 

We test the hypothesis

H0:  = 110

H1:  < 110

We compute the t statistic:

This is a one tailed test, so we can go to

our t-table with 19 degrees of freedom to
find that

= 1.73
tc

Since

-1.49 > -1.73

mca-5 424

We see that the test statistic does not fall in the critical region. We
fail to reject the null hypothesis and conclude that there is
insufficient evidence to suggest that the temperature required to
damage a computer on the average less than 110 degrees.

Hypothesis Testing for a Population Proportion

We have seen how to conduct hypothesis tests for a mean. We

now turn to proportions. The process is completely analogous,
although we will need to use the standard deviation formula for a
proportion.

Example

Suppose that you interview 1000 exiting voters about who they
voted for governor. Of the 1000 voters, 550 reported that they
voted for the democratic candidate. Is there sufficient evidence to
suggest that the democratic candidate will win the election at the
.01 level?

H0: p =.5

H1: p>.5

Since it a large sample we can use the central limit theorem to say
that the distribution of proportions is approximately normal. We
compute the test statistic:

Notice that in this formula, we have used the hypothesized

proportion rather than the sample proportion. This is because if the
null hypothesis is correct, then .5 is the true proportion and we are
not making any approximations. We compute the rejection region
using the z-table. We find that zc = 2.33.
mca-5 425

The picture shows us that 3.16 is in the rejection region. Therefore

we reject H0 so can conclude that the democratic candidate will win
with a p-value of .0008.

Example

1500 randomly selected pine trees were

tested for traces of the Bark Beetle
infestation. It was found that 153 of the
trees showed such traces. Test the
hypothesis that more than 10% of the
Tahoe trees have been infested. (Use a
5% level of significance)

Solution

The hypothesis is

H0: p = .1

H1: p > .1

We have that

Next we compute the z-score

Since we are using a 95% level of significance with a one tailed

test, we have zc = 1.645. The rejection region is shown in the
picture. We see that 0.26 does not lie in the rejection region, hence
we fail to reject the null hypothesis. We say that there is insufficient
evidence to make a conclusion about the percentage of infested
pines being greater than 10%.

Exercises
mca-5 426

A. If 40% of the nation is registered republican. Does the

Tahoe environment reflect the national proportion? Test the
hypothesis that Tahoe residents differ from the rest of the
nation in their affiliation, if of 200 locals surveyed, 75 are
registered republican.

B. If 10% of California residents are vegetarians, test the

hypothesis that people who gamble are less likely to be
vegetarians. If the 120 people polled, 10 claimed to be a
vegetarian.

Difference Between Means

Hypothesis Testing of the Difference Between Two Means

Do employees perform better at work with music playing. The

music was turned on during the working hours of a business with
45 employees. There productivity level averaged 5.2 with a
standard deviation of 2.4. On a different day the music was turned
off and there were 40 workers. The workers' productivity level
averaged 4.8 with a standard deviation of 1.2. What can we
conclude at the .05 level?

Solution

We first develop the hypotheses

H0: 1 - 2 = 0

H1: 1 - 2 > 0

Next we need to find the standard deviation. Recall from before,

we had that the mean of the difference is

x = 1 - 2

and the standard deviation is

x =

We can substitute the sample means and sample standard

deviations for a point estimate of the population means and
standard deviations. We have
mca-5 427

and

Now we can calculate the z-score.

We have

0.4
z = = 0.988
0.405

Since this is a one tailed test, the critical value is 1.645 and 0.988
does not lie in the critical region. We fail to reject the null
hypothesis and conclude that there is insufficient evidence to
conclude that workers perform better at work when the music is on.
Using the P-Value technique, we see that the P-value associated
with 0.988 is

P = 1 - 0.8389 = 0.1611

which is larger than 0.05. Yet another way of seeing that we fail to
reject the null hypothesis.

Note: It would have been slightly more accurate had we used the
t-table instead of the z-table. To calculate the degrees of freedom,
we can take the smaller of the two numbers n1 - 1 and n2 - 1. So in
this example, a better estimate would use 39 degrees of freedom.
The t-table gives a value of 1.690 for the t.95 value. Notice that
0.988 is still smaller than 1.690 and the result is the same. This is
an example that demonstrates that using the t-table and z-table for
large samples results in practically the same results.

Hypothesis Testing For a Difference Between Means for Small

Samples Using Pooled Standard Deviations (Optional)

Recall that for small samples we need to make the following

assumptions:

1. Random unbiased sample.

2. Both population distributions are normal.

mca-5 428

3. The two standard deviations are equal.

If we know , then the sampling standard deviation is:

If we do not know  then we use the pooled standard deviation.

Putting this together with hypothesis testing we can find the t-

statistic.

and use n1 + n2 - 2 degrees of freedom.

Example

Nine dogs and ten cats were tested to determine if there is a

difference in the average number of days that the animal can
survive without food. The dogs averaged 11 days with a standard
deviation of 2 days while the cats averaged 12 days with a standard
deviation of 3 days. What can be concluded? (Use  = .05)
mca-5 429

Solution

We write:

H0: dog - cat = 0

H1: dog - cat 0

We have:

n1 = 9, n2 = 10

x1 = 11, x2 = 12

s1 = 2, s2 = 3

so that

and

The t-critical value corresponding to a = .05 with 10 + 9 - 2 = 17

degrees of freedom is 2.11 which is greater than .84. Hence we fail
to reject the null hypothesis and conclude that there is not sufficient
evidence to suggest that there is a difference between the mean
starvation time for cats and dogs.

Hypothesis Testing for a Difference Between Proportions

Inferences on the Difference Between Population Proportions

If two samples are counted independently of each other we use the

test statistic:
mca-5 430

where

r1 + r2
p =
n1 + n2

and

q=1-p

Example

Is the severity of the drug problem in high school the same for boys
and girls? 85 boys and 70 girls were questioned and 34 of the boys
and 14 of the girls admitted to having tried some sort of drug. What
can be concluded at the .05 level?

Solution

The hypotheses are

H0: p1 - p2 = 0

H1: p1 - p2 0

We have

p1 = 34/85 = 0.4 p2 = 14/70 = 0.2

p = 48/155 = 0.31 q = 0.69

mca-5 431

Now compute the z-score

Since we are using a significance level of .05 and it is a two tailed

test, the critical value is 1.96. Clearly 2.68 is in the critical region,
hence we can reject the null hypothesis and accept the alternative
hypothesis and conclude that gender does make a difference for
drug use. Notice that the P-Value is

P = 1 - .9963 = 0.0037

is less than .05. Yet another way to see that we reject the null
hypothesis.

Paired Differences

Paired Data: Hypothesis Tests

Example

Is success determined by genetics?

The best such survey is one that investigates identical twins who
have been reared in two different environments, one that is
nurturing and one that is non-nurturing. We could measure the
difference in high school GPAs between each pair. This is better
than just pooling each group individually. Our hypotheses are

Ho: d = 0

H1: d > 0

where d is the mean of the differences between the matched

pairs.

We use the test statistic

mca-5 432

where sd is the standard deviation of the differences.

For a small sample we use n - 1 degrees of freedom, where n is the

number of pairs.

Paired Differences: Confidence Intervals

To construct a confidence interval for the difference of the means

we use:

xd t sd/

Example

Suppose that ten identical twins were reared apart and the mean
difference between the high school GPA of the twin brought up in
wealth and the twin brought up in poverty was 0.07. If the standard
deviation of the differences was 0.5, find a 95% confidence interval
for the difference.

Solution

We compute

[-0.29, 0.43]

We are 95% confident that the mean

difference in GPA is between -0.29 and 0.43.
Notice that 0 falls in this interval, hence we
would fail to reject the null hypothesis at the
0.05 level

Chi-Square Test

Chi-square is a statistical test commonly used to compare observed

data with data we would expect to obtain according to a specific
hypothesis. For example, if, according to Mendel's laws, you
expected 10 of 20 offspring from a cross to be male and the actual
observed number was 8 males, then you might want to know about
the "goodness to fit" between the observed and expected. Were the
mca-5 433

deviations (differences between observed and expected) the result

of chance, or were they due to other factors. How much deviation
can occur before you, the investigator, must conclude that
something other than chance is at work, causing the observed to
differ from the expected. The chi-square test is always testing what
scientists call the null hypothesis, which states that there is no
significant difference between the expected and observed result.
2
The formula for calculating chi-square ( ) is:
2
= (o-e)2/e

That is, chi-square is the sum of the squared difference between

observed (o) and the expected (e) data (or the deviation, d), divided
by the expected data in all possible categories.

For example, suppose that a cross between two pea plants yields a
population of 880 plants, 639 with green seeds and 241 with yellow
seeds. You are asked to propose the genotypes of the parents.
Your hypothesis is that the allele for green is dominant to the allele
for yellow and that the parent plants were both heterozygous for
this trait. If your hypothesis is true, then the predicted ratio of
offspring from this cross would be 3:1 (based on Mendel's laws) as
predicted from the results of the Punnett square (Figure B. 1).

- Punnett Square. Predicted offspring from cross between green

and yellow-seeded plants. Green (G) is dominant (3/4 green; 1/4
yellow).

To calculate 2 , first determine the number expected in each

category. If the ratio is 3:1 and the total number of observed
individuals is 880, then the expected numerical values should be
660 green and 220 yellow.

Chi-square requires that you use numerical values, not

percentages or ratios.

Then calculate 2 using this formula, as shown in Table B.1. Note

that we get a value of 2.668 for 2. But what does this number
mean? Here's how to interpret the 2 value:

1. Determine degrees of freedom (df). Degrees of freedom can be

calculated as the number of categories in the problem minus 1. In
mca-5 434

our example, there are two categories (green and yellow);

therefore, there is I degree of freedom.

2. Determine a relative standard to serve as the basis for accepting

or rejecting the hypothesis. The relative standard commonly used in
biological research is p > 0.05. The p value is the probability that
the deviation of the observed from that expected is due to chance
alone (no other forces acting). In this case, using p > 0.05, you
would expect any deviation to be due to chance alone 5% of the
time or less.

3. Refer to a chi-square distribution table (Table B.2). Using the

appropriate degrees of 'freedom, locate the value closest to your
calculated chi-square in the table. Determine the closestp
(probability) value associated with your chi-square and degrees of
freedom. In this case ( 2=2.668), the p value is about 0.10, which
means that there is a 10% probability that any deviation from
expected results is due to chance only. Based on our standard p >
0.05, this is within the range of acceptable deviation. In terms of
your hypothesis for this example, the observed chi-squareis not
significantly different from expected. The observed numbers are
consistent with those expected under Mendel's law.

Step-by-Step Procedure for Testing Your Hypothesis and

Calculating Chi-Square

1. State the hypothesis being tested and the predicted results.

Gather the data by conducting the proper experiment (or, if working
genetics problems, use the data provided in the problem).

2. Determine the expected numbers for each observational class.

Remember to use numbers, not percentages.

Chi-square should not be calculated if the expected value in

any category is less than 5.

3. Calculate 2 using the formula. Complete all calculations to three

significant digits. Round off your answer to two significant digits.

4. Use the chi-square distribution table to determine significance of

the value.

a. Determine degrees of freedom and locate the value in the

appropriate column.
mca-5 435
2
b. Locate the value closest to your calculated on that
degrees of freedom df row.
c. Move up the column to determine the p value.

5. State your conclusion in terms of your hypothesis.

a. If the p value for the calculated 2 is p > 0.05, accept your

hypothesis. 'The deviation is small enough that chance alone
accounts for it. A p value of 0.6, for example, means that
there is a 60% probability that any deviation from expected is
due to chance only. This is within the range of acceptable
deviation.
b. If the p value for the calculated 2 is p < 0.05, reject your
hypothesis, and conclude that some factor other than
chance is operating for the deviation to be so great. For
example, a p value of 0.01 means that there is only a 1%
chance that this deviation is due to chance alone. Therefore,
other factors must be involved.

The chi-square test will be used to test for the "goodness to fit"
between observed and expected data from several laboratory
investigations in this lab manual.

Table B.1
Calculating Chi-Square

Green Yellow
Observed (o) 639 241
Expected (e) 660 220
Deviation (o - e) -21 21
Deviation2 (d2) 441 441
2
d /e 0.668 2
2 2
= d /e = 2.668 . .

Table B.2
Chi-Square Distribution

Degrees
of

Freedom Probability (p)

(df)
0.95 0.90 0.80 0.70 0.50 0.30 0.20 0.10 0.05 0.01 0.001
1 0.004 0.02 0.06 0.15 0.46 1.07 1.64 2.71 3.84 6.64 10.83
mca-5 436

2 0.10 0.21 0.45 0.71 1.39 2.41 3.22 4.60 5.99 9.21 13.82
3 0.35 0.58 1.01 1.42 2.37 3.66 4.64 6.25 7.82 11.34 16.27
4 0.71 1.06 1.65 2.20 3.36 4.88 5.99 7.78 9.49 13.28 18.47
5 1.14 1.61 2.34 3.00 4.35 6.06 7.29 9.24 11.07 15.09 20.52
6 1.63 2.20 3.07 3.83 5.35 7.23 8.56 10.64 12.59 16.81 22.46
7 2.17 2.83 3.82 4.67 6.35 8.38 9.80 12.02 14.07 18.48 24.32
8 2.73 3.49 4.59 5.53 7.34 9.52 11.03 13.36 15.51 20.09 26.12
9 3.32 4.17 5.38 6.39 8.34 10.66 12.24 14.68 16.92 21.67 27.88
10 3.94 4.86 6.18 7.27 9.34 11.78 13.44 15.99 18.31 23.21 29.59
Nonsignificant Significant

Source: R.A. Fisher and F. Yates, Statistical Tables for Biological

Agricultural and Medical Research, 6th ed., Table IV, Oliver &
Boyd, Ltd., Edinburgh, by permission of the authors and publishers.

Chi-square test

Purpose:

The chi-square test (Snedecor and Cochran, 1989) is used to test if

a sample of data came from a population with a specific
distribution.

An attractive feature of the chi-square goodness-of-fit test is that it

can be applied to any univariate distribution for which you can
calculate the cumulative distribution function. The chi-square
goodness-of-fit test is applied to binned data (i.e., data put into
classes). This is actually not a restriction since for non-binned data
you can simply calculate a histogram or frequency table before
generating the chi-square test. However, the value of the chi-
square test statistic are dependent on how the data is binned.
Another disadvantage of the chi-square test is that it requires a
sufficient sample size in order for the chi-square approximation to
be valid.

The chi-square test is an alternative to the Anderson-Darling and

Kolmogorov-Smirnov goodness-of-fit tests. The chi-square
goodness-of-fit test can be applied to discrete distributions such as
the binomial and the Poisson. The Kolmogorov-Smirnov and
Anderson-Darling tests are restricted to continuous distributions.

The chi-square test is defined for the hypothesis:

H0: The data follow a specified distribution.
Ha: The data do not follow the specified distribution.
Test statistic:
mca-5 437

For the chi-square goodness-of-fit computation, the data are

divided into k bins and the test statistic is defined as

where is the observed frequency for bin i and is the expected

frequency for bin i. Theexpected frequency is calculated by

where F is the cumulative Distribution function for the distribution

being tested, Yu is the upper limit for class i, Yl is the lower limit for
class i, and N is the sample size.
This test is sensitive to the choice of bins. There is no optimal
choice for the bin width (since the optimal bin width depends on the
distribution). Most reasonable choices should produce similar, but
not identical, results. Dataplot uses 0.3*s, where s is the sample
standard deviation, for the class width. The lower and upper bins
are at the sample mean plus and minus 6.0*s, respectively. For the
chi-square approximation to be valid, the expected frequency
should be at least 5. This test is not valid for small samples, and if
some of the counts are less than five, you may need to combine
some bins in the tails.
Significance interval: alpha
Critical region:
The test statistic follows, approximately, a chi-square distribution
with (k - c) degrees of freedom where k is the number of non-empty
cells and c = the number of estimated parameters (including
location and scale parameters and shape parameters) for the
distribution + 1. For example, for a 3-parameter Weibull distribution,
c = 4.

Therefore, the hypothesis that the data are from a population with
the specified distribution is rejected if

Where is the chi-square percent point function with k - c

degrees of freedom and a significance level of .
In the above formulas for the critical regions, the Handbook follows
the convention that is the upper critical value from the chi-square
distribution and is the lower critical value from the chi-square
distribution. Note that this is the opposite of what is used in some
texts and software programs. In particular, Data plot uses the
opposite convention.
The chi-square test can be used to answer the following types of
questions:

 Are the data from a normal distribution?

 Are the data from a log-normal distribution?
mca-5 438

 Are the data from a Weibull distribution?

 Are the data from an exponential distribution?
 Are the data from a logistic distribution?

Are the data from a binomial distribution?

Importance:
Many statistical tests and procedures are based on specific
distributional assumptions. The assumption of normality is
particularly common in classical statistical tests. Much reliability
modeling is based on the assumption that the distribution of the
data follows a Weibull distribution.

There are many non-parametric and robust techniques that are not
based on strong distributional assumptions. By non-parametric, we
mean a technique, such as the sign test, that is not based on a
specific distributional assumption. By robust, we mean a statistical
technique that performs well under a wide range of distributional
assumptions. However, techniques based on specific distributional
assumptions are in general more powerful than these non-
parametric and robust techniques. By power, we mean the ability to
detect a difference when that difference actually exists. Therefore, if
the distributional assumption can be confirmed, the parametric
techniques are generally preferred.

If you are using a technique that makes a normality (or some other
type of distributional) assumption, it is important to confirm that this
assumption is in fact justified. If it is, the more powerful parametric
techniques can be used. If the distributional assumption is not
justified, a non-parametric or robust technique may be required.

Example

The chi-square statistic for the above example is computed as

follows:
X² = (49 - 46.1)²/46.1 + (50 - 54.2)²/54.2 + (69 - 67.7)²/67.7 + .... +
(28 - 27.8)²/27.8
= 0.18 + 0.33 + 0.03 + .... + 0.01
= 1.51
The degrees of freedom are equal to (3-1)(3-1) = 2*2 = 4, so we are
interested in the probability P( > 1.51) = 0.8244 on 4 degrees of
freedom. This indicates that there is no association between the
choice of most important factor and the grade of the student -- the
difference between observed and expected values under the null
hypothesis is negligible.

Examples of statistical tests used to analyze some basic

experiments
mca-5 439

Click on the name of the test to see an example, or scroll down to

the examples given below the table. This Table is modified from
Motulsky, H., Intuitive Biostatistics. Oxford University Press, New
York, 1995, p. 298.

When your When your

Data When your data are not data are
comparisons data are normally- Binomial
you are normally distributed, or (possess 2
making distributed are ranks or possible
scores values)

Find the
You are Find the mean,
median, Calculate a
studying one standard
interquartile proportion
set of data deviation
range (Q3-Q1)

Compare one
set of data to a Run a one- Run a Run a x2 (chi-
hypothetical sample t-test Wilcoxon Test square) test
value

Compare 2 Run a Fisher

sets of Run a 2- Run a Mann- test, or a x2
independently- sample t-test Whitney Test (chi-square)
collected data test

Compare 2 Run a t-test on

sets of data the differences
Run a
from the same between the Run a
McNemar's
subjects under data values (a Wilcoxon Test
test
different matched-pairs
circumstances t-test)

Compare 3 or Run a one- Run a

Run a chi-
more sets of way ANOVA Kruskal-Wallis
square test
data test test

Look for a Calculate the Calculate the Calculate

relationship Pearson Spearman Contingency
between 2 Correlation Correlation Correlation
variables coefficient coefficient coefficients

Look for a
Run a
linear Run a simple
Run a linear nonparametric
relationship logistic
regression linear
between 2 regression
regression
variables
mca-5 440

When your When your

Data When your data are not data are
comparisons data are normally- Binomial
you are normally distributed, or (possess 2
making distributed are ranks or possible
scores values)

Run a
Look for a
Run a power, nonparametric
non-linear
exponential, or power,
relationship
quadratic exponential, or
between 2
regression quadratic
variables
regression

Look for linear

relationships
between 1
Run a multiple Run a multiple
dependent
linear logistic
variable and 2
regression regression
or more
independent
variables

See your teacher for specific details for analyses required in your
particular class. pcbryan@prodigy.net 7-

1. You read of a survey that claims that the average teenager

watches 25 hours of TV a week and you want to check whether or
not this is true in your school (too simple a project!).

Predicted value of the variable variable: the predicted 25 hours of

Variable under study: actual hours of TV watched

Statistical test you would use: t-test

Use this test to compare the mean values (averages) of one set of
data to a predicted mean value.

2. You grow 20 radish plants in pH=10.0 water and 20 plants in

pH=3.0 water and measure the final mass of the leaves of the
plants (too simple an experiment!) to see if they grew better in one
fluid than in the other fluid.

Independent variable: pH of the fluid in which the plants were

grown
mca-5 441

Dependent variable: plant biomass

Statistical test you would use: 2-sample t-test

Use this test to compare the mean values (averages) of two sets
of data.

A Mann-Whitney test is a 2-sample t-test that is run on data that

are given rank numbers, rather than quantitative values. For
example, You want to compare the overall letter-grade GPA of
students in one class with the overall letter-grade GPA of students
in another class. You rank the data from low to high according to
the letter grade (here, A = 1, B = 2, C = 3, D = 4, E =5 might be
your rankings; you could also have set A = 5, B = 4, ...).

3. You give a math test to a group of students. Afterwards you tell

? of the students a method of analyzing the problems, then re-test
all the students to see if use of the method led to improved test
scores.

Independent variable: test-taking method (your method vs. no

imparted method)

Dependent variable: (test scores after method - test scores before

method)

Statistical test you would use: matched-pairs t-test

Use this test to compare data from the same subjects under two
different conditions.

4. You grow radish plants given pesticide-free water every other

day, radish plants given a 5% pesticide solution every other day,
and radish plants given a 10% pesticide solution every other day,
then measure the biomass of the plants after 30 days to find
whether there was any difference in plant growth among the three
groups of plants.

Independent variable: pesticide dilution

Dependent variable: plant biomass

Statistical test you would use: ANOVA

Use this test to compare the mean values (averages) of more than
two sets of data where there is more than one independent
variable but only one dependent variable. If you find that your
data differ significantly, this says only that at least two of the data
mca-5 442

sets differ from one another, not that all of your tested data sets
differ from one another.

If your ANOVA test indicates that there is a statistical

difference in your data, you should also run Bonferroni paired t-
tests to see which independent variables produce significantly
different results. This test essentially penalizes you more and more
as you add more and more independent variables, making it more
difficult to reject the null hypothesis than if you had tested fewer
independent variables.

One assumption in the ANOVA test is that your data are normally-
distributed (plot as a bell curve, approximately). If this is not true,
you must use the Kruskall-Wallis test below.

5. You ask children, teens, and adults to rate their response to a

set of statements, where 1 = strongly agree with the statement, 2 =
agree with the statement, 3 = no opinion, 4 = disagree with the
statement, 5 = strongly disagree with the statement, and you want
to see if the answers are dependent on the age group of the tested
subjects.

Independent variables: age groups of subject

Dependent variable: responses of members of those age groups to

your statements

Statistical test you would use: Kruskall-Wallis Test. Use this test
to compare the mean values (averages) of more than two sets of
data where the data are chosen from some limited set of values or
if your data otherwise don't form a normal (bell-curve) distribution.
This example could also be done using a two-way chi-square
test.

An example of the Kruskall-Wallis Test for non-normal data is: You

compare scores of students on Math and English tests under
different sicrumstances: no music playing, Mozart playing, rock
musing playing. When you score the tests, you find in at least one
case that the average.score is a 95 and the data do not form a bell-
shaped curve because there are no scores above 100, many
scores in the 90s, a few in the 80s, and fewer still in the 70s, for
example.

Independent variables: type of background music

Dependent variable: score on the tests , with at least one set of

scores not normally-distributed
mca-5 443

6. You think that student grades are dependent on the number of

hours a week students study. You collect letter grades from
students and the number of hours each student studies in a week.

Independent variables: hours studied

Dependent variable: letter grade in a specific class

Statistical test you would use: Wilcoxon Signed Rank Test. Use
this test to compare the mean values (averages) of two sets of
data, or the mean value of one data set to a hypothetical mean,
where the data are ranked from low to high (here, A = 1, B = 2, C =
3, D = 4, E =5 might be your rankings; you could also have set A =
5, B = 4, ...).

7. You ask subjects to rate their response to a set of statements

that are provided with a set of possible responses such as: strongly
agree with the statement, agree with the statement, no opinion,
disagree with the statement, strongly disagree with the statement.

Independent variable: each statement asked

Dependent variable: response to each statement

Statistical test you would use: x2 (chi-square) test (the 'chi' is

pronounced like the 'chi' in 'chiropracter') for within-age-group
variations.

For this test, typically, you assume that all choices are equally likely
and test to find whether this assumption was true. You would
assume that, for 50 subjects tested, 10 chose each of the five
options listed in the example above. In this case, your observed
values (O) would be the number of subjects who chose each
response, and your expected values (E) would be 10.

The chi-square statistic is the sum of: (Observed value -Expected

value)2 / Expected value

Use this test when your data consist of a limited number of possible
values that your data can have. Example 2: you ask subjects which
toy they like best from a group of toys that are identical except that
they come in several different colors. Independent variable: toy
color; dependent variable: toy choice.

McNemar's test is used when you are comparing some aspect of

the subject with that subject's response (i.e., answer to the survey
compared to whether or not the student went to a particular middle
school). McNemar's test is basically the same as a chi-square test
in calculation and interpretation.
mca-5 444

8. You look for a relationship between the size of a letter that a

subject can read at a distance of 5 meters and the score that the
subject achieves in a game of darts (having had them write down
their experience previously at playing darts).

Independent variable #1: vision-test result (letter size)

Independent variable #2: darts score

Statistical test you would use: Correlation (statistics: r2 and r)

Use this statistic to identify whether changes in one independent

variable are matched by changes in a second independent variable.
Notice that you didn't change any conditions of the test, you only
made two separate sets of measurements

9. You load weights on four different lengths of the same type and
cross-sectional area of wood to see if the maximum weight a piece
of the wood can hold is directly dependent on the length of the
wood.

Independent variable: length of wood

Dependent variable: weight that causes the wood to break

Statistical test you would use: Linear regression (statistics: r2

and r)

Fit a line to data having only one independent variable and one
dependent variable.

10. You load weights on four different lengths and four different
thicknesses of the same type of wood to see if the maximum weight
a piece of the wood can hold is directly dependent on the length
and thickness of the wood, and to find which is more important,
length or weight.

Independent variables: length of wood, weight of wood

Dependent variable: weight that causes the wood to break

Statistical test you would use: Multiple Linear regression

(statistics: r2 and r)

Fit a line to data having two or more independent variables and one
dependent variable.
mca-5 445

11. You load weights on strips of plastic trash bags to find how
much the plastic stretches from each weight. Research that you do
indicates that plastics stretch more and more as the weight placed
on them increases; therefore the data do not plot along a straight
line.

Independent variables: weight loaded on the plastic strip

Dependent variable: length of the plastic strip

Statistical test you would use: Power regression of the form y =

axb , or Exponential regression of the form y = abx , or
Quadratic regression of the form y = a + bx + cx2 (statistics: r2
and r)

Fit a curve to data having only one independent variable and one
dependent variable.

There are numerous polynomial regressions of this form, found on

the STAT:CALC menu of your graphing calculator.

Paired Sample t-test

A paired sample t-test is used to determine whether there is a

significant difference between the average values of the same
measurement made under two different conditions. Both
measurements are made on each unit in a sample, and the test is
based on the paired differences between these two values. The
usual null hypothesis is that the difference in the mean values is
zero. For example, the yield of two strains of barley is measured in
successive years in twenty different plots of agricultural land (the
units) to investigate whether one crop gives a significantly greater
yield than the other, on average.

The null hypothesis for the paired sample t-test is

H0: d = µ1 - µ2 = 0
where d is the mean value of the difference.
This null hypothesis is tested against one of the following
alternative hypotheses, depending on the question posed:
H1: d = 0
H1: d > 0
H1: d < 0

The paired sample t-test is a more powerful alternative to a two

sample procedure, such as the two sample t-test, but can only be
used when we have matched samples.
mca-5 446

Student's t Distribution

According to the central limit theorem, the sampling distribution of a

statistic (like a sample mean) will follow a normal distribution, as
long as the sample size is sufficiently large. Therefore, when we
know the standard deviation of the population, we can compute a z-
score, and use the normal distribution to evaluate probabilities with
the sample mean.

But sample sizes are sometimes small, and often we do not know
the standard deviation of the population. When either of these
problems occur, statisticians rely on the distribution of the t
statistic (also known as the t score), whose values are given by:

t = [ x - μ ] / [ s / sqrt( n ) ]

where x is the sample mean, μ is the population mean, s is the

standard deviation of the sample, and n is the sample size. The
distribution of the t statistic is called the t distribution or the
Student t distribution.

Degrees of Freedom

There are actually many different t distributions. The particular form

of the t distribution is determined by its degrees of freedom. The
degrees of freedom refers to the number of independent
observations in a set of data.

When estimating a mean score or a proportion from a single

sample, the number of independent observations is equal to the
sample size minus one. Hence, the distribution of the t statistic from
samples of size 8 would be described by a t distribution having 8 - 1
or 7 degrees of freedom. Similarly, a t distribution having 15
degrees of freedom would be used with a sample of size 16.

For other applications, the degrees of freedom may be calculated

differently. We will describe those computations as they come up.

Properties of the t Distribution

The t distribution has the following properties:

 The mean of the distribution is equal to 0 .

 The variance is equal to v / ( v - 2 ), where v is the degrees
of freedom (see last section) and v > 2.
 The variance is always greater than 1, although it is close to
1 when there are many degrees of freedom. With infinite
degrees of freedom, the t distribution is the same as the
standard normal distribution.
mca-5 447

When to Use the t Distribution

The t distribution can be used with any statistic having a bell-

shaped distribution (i.e., approximately normal). The central limit
theorem states that the sampling distribution of a statistic will be
normal or nearly normal, if any of the following conditions apply.

 The population distribution is normal.

 The sampling distribution is symmetric, unimodal, without
outliers, and the sample size is 15 or less.
 The sampling distribution is moderately skewed, unimodal,
without outliers, and the sample size is between 16 and 40.
 The sample size is greater than 40, without outliers.

The t distribution should not be used with small samples from

populations that are not approximately normal.

Probability and the Student t Distribution

When a sample of size n is drawn from a population having a

normal (or nearly normal) distribution, the sample mean can be
transformed into a t score, using the equation presented at the
beginning of this lesson. We repeat that equation below:

t = [ x - μ ] / [ s / sqrt( n ) ]

where x is the sample mean, μ is the population mean, s is the

standard deviation of the sample, n is the sample size, and degrees
of freedom are equal to n - 1.

The t score produced by this transformation can be associated with

a unique cumulative probability. This cumulative probability
represents the likelihood of finding a sample mean less than or
equal to x, given a random sample of size n.

The easiest way to find the probability associated with a particular t

score is to use the T Distribution Calculator, a free tool provided by
Stat Trek.

Notation and t Scores

Statisticians use tα to represent the t-score that has a cumulative

probability of (1 - α). For example, suppose we were interested in
the t-score having a cumulative probability of 0.95. In this example,
α would be equal to (1 - 0.95) or 0.05. We would refer to the t-score
as t0.05
mca-5 448

Of course, the value of t0.05 depends on the number of degrees of

freedom. For example, with 2 degrees of freedom, that t0.05 is equal
to 2.92; but with 20 degrees of freedom, that t0.05 is equal to 1.725.

Note: Because the t distribution is symmetric about a mean of zero,

the following is true.

tα = -t1 - alpha And t1 - alpha = -tα

Thus, if t0.05 = 2.92, then t0.95 = -2.92.

T Distribution Calculator

The T Distribution Calculator solves common statistics problems,

based on the t distribution. The calculator computes cumulative
probabilities, based on simple inputs. Clear instructions guide you
to an accurate solution, quickly and easily. If anything is unclear,
frequently-asked questions and sample problems provide
straightforward explanations. The calculator is free. It can be found
under the Stat Tables menu item, which appears in the header of
every Stat Trek web page.

T Distribution
Calculator

Test Your Understanding of This Lesson

Problem 1

Acme Corporation manufactures light bulbs. The CEO claims that

an average Acme light bulb lasts 300 days. A researcher randomly
selects 15 bulbs for testing. The sampled bulbs last an average of
290 days, with a standard deviation of 50 days. If the CEO's claim
were true, what is the probability that 15 randomly selected bulbs
would have an average life of no more than 290 days?

Note: There are two ways to solve this problem, using the T
Distribution Calculator. Both approaches are presented below.
Solution A is the traditional approach. It requires you to compute
the t score, based on data presented in the problem description.
Then, you use the T Distribution Calculator to find the probability.
Solution B is easier. You simply enter the problem data into the T
Distribution Calculator. The calculator computes a t score "behind
the scenes", and displays the probability. Both approaches come
up with exactly the same answer.

Solution A
mca-5 449

The first thing we need to do is compute the t score, based on the

following equation:

t = [ x - μ ] / [ s / sqrt( n ) ]
t = ( 290 - 300 ) / [ 50 / sqrt( 15) ] = -10 / 12.909945 = - 0.7745966

where x is the sample mean, μ is the population mean, s is the

standard deviation of the sample, and n is the sample size.

Now, we are ready to use the T Distribution Calculator. Since we

know the t score, we select "T score" from the Random Variable
dropdown box. Then, we enter the following data:

 The degrees of freedom are equal to 15 - 1 = 14.

 The t score is equal to - 0.7745966.

The calculator displays the cumulative probability: 0.226. Hence, if

the true bulb life were 300 days, there is a 22.6% chance that the
average bulb life for 15 randomly selected bulbs would be less than
or equal to 290 days.

Solution B:

This time, we will work directly with the raw data from the problem.
We will not compute the t score; the T Distribution Calculator will do
that work for us. Since we will work with the raw data, we select
"Sample mean" from the Random Variable dropdown box. Then,
we enter the following data:

 The degrees of freedom are equal to 15 - 1 = 14.

 Assuming the CEO's claim is true, the population mean
equals 300.
 The sample mean equals 290.
 The standard deviation of the sample is 50.

The calculator displays the cumulative probability: 0.226. Hence,

there is a 22.6% chance that the average sampled light bulb will
burn out within 290 days.

Problem 2

Suppose scores on an IQ test are normally distributed, with a mean

of 100. Suppose 20 people are randomly selected and tested. The
standard deviation in the sample group is 15. What is the
probability that the average test score in the sample group will be at
most 110?
mca-5 450

Solution:

To solve this problem, we will work directly with the raw data from
the problem. We will not compute the t score; the T Distribution
Calculator will do that work for us. Since we will work with the raw
data, we select "Sample mean" from the Random Variable
dropdown box. Then, we enter the following data:

 The degrees of freedom are equal to 20 - 1 = 19.

 The population mean equals 100.
 The sample mean equals 110.
 The standard deviation of the sample is 15.

We enter these values into the T Distribution Calculator. The

calculator displays the cumulative probability: 0.996. Hence, there
is a 99.6% chance that the sample average will be no greater than
110.

t-Test for the Significance of the Difference between the Means

of Two Independent Samples

This is probably the most widely used statistical test of all time,
and certainly the most widely known. It is simple, straightforward,
easy to use, and adaptable to a broad range of situations.
No statistical toolbox should ever be without it.

Its utility is occasioned by the fact that scientific research very

often examines the phenomena of nature two variables at a time,
with an eye toward answering the basic question: Are these two
variables related? If we alter the level of one, will we thereby alter
the level of the other? Or alternatively: If we examine two different
levels of one variable, will we find them to be associated with
different levels of the other?

Here are three examples to give you an idea of how these

abstractions might find expression in concrete reality. On the left of
each row of cells is a specific research question, and on the right is
a brief account of a strategy that might be used to answer it. The
first two examples illustrate a very frequently employed form of
experimental design that involves randomly sorting the members of
a subject pool into two separate groups, treating the two groups
differently with respect to a certain independent variable, and then
measuring both groups on a certain dependent variable with the
aim of determining whether the differential treatment produces
differential effects. (Variables: Independent and Dependent.) A
quasi-experimental variation on this theme, illustrated by the third
example, involves randomly selecting two groups of subjects that
already differ with respect to one variable, and then measuring both
mca-5 451

groups on another variable to determine whether the different levels

of the first are associated with different levels of the second.

Question Strategy
Does the Begin with a "subject pool" of seeds of the type of
presence of a plant in question. Randomly sort them into two
certain kind of groups, A and B. Plant and grow them under
mycorrhizal conditions that are identical in every respect
fungus except one: namely, that the seeds of group A
enhance the (the experimental group) are grown in a soil that
growth of a contains the fungus, while those of group B (the
certain kind of control group) are grown in a soil that does not
plant? contain the fungus. After some specified period of
time, harvest the plants of both groups and take
the relevant measure of their respective degrees
of growth. If the presence of the fungus does
enhance growth, the average measure should
prove greater for group A than for group B.
Do two types of Begin with a subject pool of college students,
music, type-I relatively homogeneous with respect to age,
and type-II, record of academic achievement, and other
have different variables potentially relevant to the performance
effects upon of such a task. Randomly sort the subjects into
the ability of two groups, A and B. Have the members of each
college group perform the series of mental tasks under
students to conditions that are identical in every respect
perform a except one: namely, that group A has music of
series of type-I playing in the background, while group B
mental tasks has music of type-II. (Note that the distinction
requiring between experimental and control group does not
concentration? apply in this example.) Conclude by measuring
how well the subjects perform on the series of
tasks under their respective conditions. Any
difference between the effects of the two types of
music should show up as a difference between
the mean levels of performance for group A and
group B.
Do two strains With this type of situation you are in effect starting
of mice, A out with two subject pools, one for strain A and
and B, differ one for strain B. Draw a random sample of size Na
with respect to from pool A and another of size Nb from pool B.
their ability to Run the members of each group through a
learn to avoid standard aversive-conditioning procedure,
an aversive measuring for each one how well and quickly the
stimulus? avoidance behavior is acquired. Any difference
between the avoidance-learning abilities of the
mca-5 452

two strains should manifest itself as a difference

between their respective group means.

In each of these cases, the two samples are independent of

each other in the obvious sense that they are separate samples
containing different sets of individual subjects. The individual
measures in group A are in no way linked with or related to any of
the individual measures in group B, and vice versa. The version of
a t-test examined in this chapter will assess the significance of the
difference between the means of two such samples, providing:
(i) that the two samples are randomly drawn from normally
distributed populations; and (ii) that the measures of which the two
samples are composed are equal-interval.

To illustrate the procedures for this version of a t-test, imagine we

were actually to conduct the experiment described in the second of
the above examples. We begin with a fairly homogeneous subject
pool of 30 college students, randomly sorting them into two groups,
A and B, of sizes Na=15 and Nb=15. (It is not essential for this
procedure that the two samples be of the same size.) We then have
the members of each group, one at a time, perform a series of 40
mental tasks while one or the other of the music types is playing in
the background. For the members of group A it is music of type-I,
while for those of group B it is music of type-II. The following table
shows how many of the 40 components of the series each subject
was able to complete. Also shown are the means and sums of
squared deviates for the two groups.
Group A Group B
music of type-I music of type-II
26 21 22 18 23 21
26 19 22 20 20 29
26 25 24 20 16 20
21 23 23 26 21 25
18 29 22 17 18 19
Na=15 Nb=15
Ma=23.13 Mb=20.87
SSa=119.73 SSb=175.73
Ma—Mb=2.26

¶Null Hypothesis

Recall from Chapter 7 that whenever you perform a statistical

test, what you are testing, fundamentally, is the null hypothesis. In
mca-5 453

general, the null hypothesis is the logical antithesis of whatever

hypothesis it is that the investigator is seeking to examine. For the
present example, the research hypothesis is that the two types of
music have different effects, so the null hypothesis is that they do
not have different effects. Its immediate implication is that any
difference we find between the means of the two samples shoul
significantly differ from zero.

If the investigator specifies the direction of the difference in

advance as either

task performance will which would be supported by finding the

be mean of sample A to be significantly
better with type-I music greater
than with type-II than the mean of sample B (Ma>Mb)

task performance will which would be supported by finding the

be mean of sample B to be significantly
better with type-II greater
music than the mean of sample A (Mb>Ma)
than with type-I

then the research hypothesis is directional and permits a one-tail

test of significance. A non-directional research hypothesis would
require a two-tail test, as it is the equivalent of saying "I'm expecting
a difference in one direction or the other, but I can't guess which."
For the sake of discussion, let us suppose we had started out with
the directional hypothesis that task performance will be better with
type-I music than with type-II music. Clearly our observed result,
Ma—Mb=2.26, is in the hypothesized direction. All that remains is to
determine how confident we can be that it comes from anything
more than mere chance coincidence.

¶Logic and Procedure

(1) The mean of a sample randomly drawn from a normally

distributed source population belongs to a sampling distribution
of sample means that is also normal in form. The overall mean
of this sampling distribution will be identical with the mean of the
source population:
i M =i source
mca-5 454

(2) For two samples, each randomly drawn from a normally

distributed source population, the difference between the
means of the two samples,

Ma—Mb

belongs to a sampling distribution that is normal in form, with an

overall mean equal to the difference between the means of the
two source populations

i M-M =i source A — source B

(2) On the null hypothesis, i source A and i source B are identical,

hence
i M-M =0
(3)
For the present example, the null hypothesis holds that the two
types of music do not have differential effects on task
performance. This is tantamount to saying that the measures of
task performance in groups A and B are all drawn indifferently
from the same source population of such measures. In items 3
and 4 below, the phrase "source population" is a shorthand way
of saying "the population of measures that the null hypothesis
assumes to have been the common source of the measures in
both groups."

(3) If we knew the variance of the source population, we would then

be able to calculate the standard deviation (aka "standard
error") of the sampling distribution of sample-mean differences
as

2 2
x source x source
x M-M = sqrt[ + ]
Na Nb

(3) This, in turn, would allow us to test the null hypothesis for any
particular Ma—Mb difference by calculating the appropriate z-
ratio

MXa—MXb
z=
M-M
mca-5 455

(3) and referring the result to the unit normal distribution.

In most practical research situations, however, the variance of

the source population, hence also the value of i M-M, can be
arrived at only through estimation. In these cases the test of the
null hypothesis is performed not with z but with t:

MXa—MXb
t =
est.i M-M

(3) The resulting value belongs to the particular sampling

distribution of t that is defined by df=(Na—1)+(Nb—1).

(4) To help you keep track of where the particular numerical values
are coming from beyond this point, here again are the summary
statistics for our hypothetical experiment on the effects of two
types of music:
Group A Group B
music of type-I music of type-II
Na=15 Nb=15
Ma=23.13 Mb=20.86
SSa=119.73 SSb=175.73
Ma—Mb=2.26
(3) As indicated in Chapter 9, the variance of the source
population can be estimated as

SSa+SSb
2
{s p} =
(Na—1)+(Nb—1)

(3) which for the present example comes out as

119.73+175.73
2
{s p} = = 10.55
14+14

(3) This, in turn, allows us to estimate the standard deviation of the

sampling distribution of sample-mean differences as
mca-5 456

est.i M-M

{s2p} {s2p}
= sqrt [ + ]
Na Nb

10.55 10.55
= sqrt [ + ] = ±1.19
15 15

(4) And with this estimated value of i M-M in hand, we are then
able to calculate the appropriate t-ratio as

MXa—MXb
t =
est.i M-M

23.13—20.87
= = +1.9
1.19

(4) with df=(15—1)+(15—1)=28

In the calculation of a two-sample t-ratio, note that the sign of t

depends on the direction of the difference between MXa and
MXb. MXa>MXb will produce a positive value of t, while MXa<MXb
will produce a negative value of t.

¶Inference

Figure 11.1 shows the sampling distribution of t for df=28. Also

shown is the portion of the table of critical values of t (Appendix C)
that pertains to df=28. The designation "tobs" refers to our observed
value of t=+1.9. We started out with the directional research
hypothesis that task performance would be better for group A than
for group B, and as our observed result, MXa—MXb=2.26, proved
consistent with that hypothesis, the relevant critical values of t are
those that pertain to a directional (one-tail) test of significance: 1.70
for the .05 level of significance, 2.05 for the .025 level, 2.47 for the
.01 level, and so on.
mca-5 457

Figure 11.1. Sampling Distribution of t for df=28

If our observed value of t had ended up smaller than 1.70, the

result of the experiment would be non-significant vis-à-vis the
conventional criterion that the mere-chance probability of a result
must be equal to or less than .05. If it had come out at
precisely 1.70, we would conclude that the result is significant at
the .05 level. As it happens, the observed t meets and somewhat
exceeds the 1.70 critical value, so we conclude that our result is
significant somewhat beyond the .05 level. If the observed t had
been equal to or greater than 2.05, we would have been able to
regard the result as significant at or beyond the .025 level; and
so on.

The same logic would have applied to the left tail of the
distribution if our initial research hypothesis had been in the
mca-5 458

opposite direction, stipulating that task performance would be better

with music of type-II than with music of type-I. In this case we would
have expected MXa to be smaller than MXb, which would have
entailed a negative sign for the resulting value of t.

If, on the other hand, we had begun with no directional hypothesis

at all, we would in effect have been expecting

either MXa>MXb or MXa<MXb

and that disjunctive expectation ("either the one or the other")

would have required a non-directional, two-tailed test. Note that for
a non-directional test our observed value of t=+1.9 (actually, for a
two-tailed test it would have to be regarded as t=±1.9) would not
be significant at the minimal .05 level. (The distinction between
directional and non-directional tests of significance is introduced in
Chapter 7.)

In this particular case, however, we did begin with a directional

hypothesis, and the obtained result as assessed by a directional
test is significant beyond the .05 level. The practical, bottom-line
meaning of this conclusion is that the likelihood of our experimental
result having come about through mere random variability—mere
chance coincidence, "sampling error," the luck of the scientific
draw—is a somewhat less that 5%; hence, we can have about 95%
confidence that the observed result reflects something more than
mere random variability. For the present example, this "something
more" would presumably be a genuine difference between the
effects of the two types of music on the performance of this
particular type of task.
Step-by-Step Computational Procedure: t-Test for the
Significance of the Difference between the Means of Two
independent Samples

Note that this test makes the following assumptions and can be
meaningfully applied only insofar as these assumptions are met:
That the two samples are independently and randomly drawn from
the source population(s).
That the scale of measurement for both samples has the properties
of an equal interval scale.
That the source population(s) can be reasonably supposed to have
a normal distribution.

Step 1. For the two samples, A and B, of sizes of Na and Nb

respectively, calculate

MXa and SSa the mean and sum of

squared deviates of sample A
mca-5 459

the mean and sum of

MXb and SSa
squared deviates of sample B

Step 2. Estimate the variance of the source population as

SSa+SSb
{s2p} =
(Na—1)+(Nb—1)

Recall that "source population" in this context means "the

population of measures that the null hypothesis assumes to have
been the common source of the measures in both groups."

Step 3. Estimate the standard deviation of the sampling distribution

of sample-mean differences (the "standard error" of MXa—MXb) as

{s2p} {s2p}
est.i M-M = sqrt [ + ]
Na Nb

Step 4. Calculate t as

MXa—MXb
t =
est.i M-M

Step 5. Refer the calculated value of t to the table of critical values

of t (Appendix C), with df=(Na—1)+(Nb—1). Keep in mind that a
one-tailed directional test can be applied only if a specific
directional hypothesis has been stipulated in advance; otherwise it
must be a non-directional two-tailed test.

Note that this chapter includes a subchapter on the Mann-

Whitney Test, which is a non-parametric alternative to the
independent-samples t-test.
Have you understood?
1. The average travel time taken based on a random sample of
10 people working in a company to reach the office is 40
minutes with a standard deviation of 10 minutes. Establish
the 95% confidence interval for the mean travel time of
everyone in the company and redesign the working hours.
2. A marketing manager of a fast-food restaurant in a city
wishes to estimate the average yearly amount that families
spend on fast food restaurants. He wants the estimate to be
mca-5 460

within +or – Rs100 with a confidence level of 99%.It is known

from an earlier pilot study that the standard deviation of the
family expenditure on fast food restaurant is Rs500.How
many families must be chosen for this problem?
3. a company manufacturing sports goods wants to estimate
the proportion of cricket players among high school students
in India. The company wants the estimate to be within + or -
0.03.with a confidence level of 99%.a pilot study done earlier
reveals that out of 80 high school students,36 students play
cricket. What should be the sample size for this study?



5
INTRODUCTION TO STEADY-STATE
QUEUEING THEORY

This handout introduces basic concepts in steady-state queueing

theory for two very basic systems later labeled M/M/1 and M/M/k (k
> 2). Both assume Poisson arrivals and exponential service. While
the more general Little equations are shown, more in-depth texts
such as chapter 6 of the Banks, Carson, Nelson, Nicole 4th edition
Discrete-Event System Simulation (Prentice-Hall, 2005, ISBN 0-13-
144679-7). There are other more comprehensive books that I can
recommend if you desire further study in this area. You must keep
in mind that the formulas presented here are strictly for the steady-
state or long-term performance of queueing systems. Additionally
only the simplest statistical distribution assumptions (Poisson
arrivals, exponential service) are covered in this brief handout.
Nonetheless, if a closed form queueing solution exists to a
particular situation, then use it instead of putting together a
simulation. You are engineers and will be paid to use your brain to
find cost effective timely solutions. Try doing so in this class as
well.

The vast majority of this handout comes from a wonderful book

Quantitative Methods for Business, 9th Ed., by Anderson, Sweeney,
& Williams, ISBN#-324-18413-1 (newer editions are now out). This
is a very understandable non-calculus operations research book
that I highly recommend to everyone – get a copy and you will
finally understand things that have been confusing to you during
mca-5 461

your tough ISE classes. As you bust your posterior (assuming that
you desire to obtain that hard to get passing mark in our class), you
should check your skills on solving homework problems in one of at
least the two following ways:

 Try the homework problems stored on our web page. The

problems are in a handy Word file. The solutions are also in
another Word file on the web page.
 Use the Excel template on the web page to check your
solutions especially on finding minimal total cost solutions
prior to our killer exams.

Hint: Do many problems of all types before our 2 hour exams. This
pertains to queueing, and anything covered by the syllabus. You
control your destiny in this course. No whining is anticipated after
the exams – suck it up. Imagine being a civil engineer that does
not do a full job on a bridge design and people die – partial
solutions do not make my day. So learn this stuff and learn it well.
Then consider taking ISE 704 later as you will have a major head
start on the grad students just learning simulation for the 1st time in
that class. Perhaps if you are really good, you can work on your
MBA at Otterbein later in your career.

Now into the wide world of queueing…

Recall the last time that you had to wait at a supermarket checkout
counter, for a teller at your local bank, or to be served at a
fast-food restaurant. In these and many other waiting line
situations, the time spent waiting is undesirable. Adding more
checkout clerks, bank tellers, or servers is not always the most
economical strategy for improving service, so businesses need to
determine ways to keep waiting times within tolerable limits.

Models have been developed to help managers understand and

make better decisions concerning the operation of waiting lines. In
quantitative methods terminology, a waiting line is also known as a
queue, and the body of knowledge dealing with waiting lines is
known as queuing theory. In the early 1900s, A. K. Erlang, a
Danish telephone engineer, began a study of the congestion and
waiting times occurring in the completion of telephone calls. Since
then, queuing theory has grown far more sophisticated with
applications in a wide variety of waiting line situations.

Waiting line models consist of mathematical formulas and

relationships that can be used to determine the operating
characteristics (performance measures) for a waiting line.
Operating characteristics of interest include the following:

1. The probability that no units are in the system

mca-5 462

2. The average number of units in the waiting line

3. The average number of units in the system (the number of
units in the waiting line plus the number of units being
served)
4. The average time a unit spends in the waiting line
5. The average time a unit spends in the system (the waiting
time plus the service time)
6. The probability that an arriving unit has to wait for service

Managers who have such information are better able to make

decisions that balance desirable service levels against the cost of
providing the service.

STRUCTURE OF A WAITING LINE SYSTEM

To illustrate the basic features of a waiting line model, we consider

the waiting line at the Burger Dome fast-food restaurant. Burger
Dome sells hamburgers, cheeseburgers, french fries, soft drinks,
and milk shakes, as well as a limited number of specialty items
and dessert selections. Although Burger Dome would like to serve
each customer immediately, at times more customers arrive than
can be handled by the Burger Dome food service staff. Thus,
customers wait in line to place and receive their orders.

Burger Dome is concerned that the methods currently used to

serve customers are resulting in excessive waiting times.
Management wants to conduct a waiting line study to help
determine the best approach to reduce waiting times and improve
service.

Single-Channel Waiting Line

In the current Burger Dome operation, a server takes a customer's

order, determines the total cost of the order, takes the money from
the customer, and then fills the order. Once the first customer's
order is filled, the server takes the order of the next customer
waiting for service. This operation is an example of a
single-channel waiting line. Each customer entering the Burger
Dome restaurant must pass through the one channel-one
order-taking and order-filling station-to place an order, pay the bill,
and receive the food. When more customers arrive than can be
served immediately, they form a waiting line and wait for the
order-taking and order-filling station to become available. A
diagram of the Burger Dome single-channel waiting line is shown
in Figure .1.
mca-5 463

Distribution of Arrivals

Defining the arrival process for a waiting line involves determining

the probability distribution for the number of arrivals in a given
period of time. For many waiting line situations, the arrivals occur
randomly and independently of other arrivals, and we cannot
predict when an arrival will occur. In such cases, quantitative
analysts have found that the Poisson probability distribution
provides a good description of the arrival pattern.

FIGURE 14.1 THE BURGER DOME

SINGLE-CHANNEL WAITING LINE

System

Server
Customer
Arrivals
Order Taking Customer
Waiting Line and Order Leaves
Filling After Order
Is filled

The Poisson probability function provides the probability of x

arrivals in a specific time period. The probability function is as
follows.

(14.1)

wher
e
x = the number of arrivals in the time
period
mca-5 464

 = the mean number of arrivals per

time period

Suppose that Burger Dome analyzed data on customer arrivals

and concluded that the mean arrival rate is 45 customers per hour.
For a one-minute period, the mean arrival rate would be  = 45
customers/60 minutes = 0.75 customers per minute. Thus, we can
use the following Poisson probability function to compute the
probability of x customer arrivals during a one-minute period:

 x e  0.75x e0.75
P( x)   (14.2)
x! x!

Thus, the probabilities of 0, 1, and 2 customer arrivals during a

one-minute period are

(0.75)0 e0.75 0.75

P(0)   e  0.4724
0!
(0.75)1e0.75
P(1)   0.75e0.75  0.75(0.4724)  0.3543
1!
(0.75)2 e0.75 (0.75)2 e0.75 (0.5625)(0.4724)
P(2)     0.1329
2! 2! 2

The probability of no customers in a one-minute period is 0.4724,

the probability of one customer in a one-minute period is 0.3543,
and the probability of two customers in a one-minute period is
0.1329. Table 14.1 shows the Poisson probabilities for customer
arrivals during a one-minute period.

The waiting line models that will be presented in Sections 14.2 and
14.3 use the Poisson probability distribution to describe the
customer arrivals at Burger Dome. In practice; you should record
the actual number of arrivals per time period for several days or
weeks: and compare the frequency distribution of the observed
number of arrivals to the Poisson probability distribution to
determine whether the Poisson probability distribution provides a
reasonable approximation of the arrival distribution.

TABLE 1 POISSON PROBABILITIES FOR THE NUMBER OF

CUSTOMER
ARRIVALS AT A BURGER DOME RESTAURANT
DURING A
ONE-MINUTE PERIOD (  0.75)

Number of Arrivals Probability

0 0.4724
1 0.3543
2 0.1329
mca-5 465

3 0.0332
4 0.0062
5 or more 0.0010

Distribution of Service Times

The service time is the time a customer spends at the service
facility once the service has started. At Burger Dome, the service
time starts when a customer begins to place the order with the food
server and continues until the customer receives the order. Service
times are rarely constant. At Burger Dome, the number of items
ordered and the mix of items ordered vary considerably from one
customer to the next. Small orders can be handled in a matter of
seconds, but large orders may require more than two minutes.

Quantitative analysts have found that if the probability distribution

for the service time can be assumed to follow an exponential
probability distribution, formulas are available for providing useful
information about the operation of the waiting line. Using an
exponential probability distribution, the probability that the service
time will be less than Por(service to a 
equal time t ) of
time e   t t is fixed
1 length (14.3)

where
 = the mean number of units that can be
served per time period
A property of the Suppose that Burger Dome studied the
exponential probability order-taking and order-filling process
distribution is that there and found that the single food server
is a 0.6321 probability can process an average of 60 customer
that the random orders per hour. On a one minute basis,
variable takes on a the mean service rate would be  = 60
value less than its mean. customers/60 minutes = 1 customer per
minute. For example, with  = 1, we
can use equation (14.3) to compute
probabilities such as the probability an
order can be processed in 1/2 minute or
less, 1 minute or less, and waiting line 2
minutes or less. These computations
are
P(service time < 0.5 min.) = 1 - e1(0.5)  1  0.6065  0.3935
P(service time < 1.0 min.) = 1 - e1(1.0)  1  0.3679  0.6321
P(service time < 2.0 min.) = 1 - e1(2.0)  1  0.1353  0.8647

Thus, we would conclude that there is a 0.3935 probability that an

order can be processed in 1/2 minute or less, a 0.6321 probability
that it can be processed in 1 minute or less, and a 0.8647
probability that it can be processed in 2 minutes or less.

In several waiting line models presented in this chapter, we

assume that the probability distribution for the service time follows
mca-5 466

an exponential probability distribution. In practice, you should

collect data on actual service times to determine whether the
exponential probability distribution is a reasonable approximation
of the service times for your application.

Queue Discipline
In describing a waiting line system, we must define the manner in
which the waiting units are arranged for service. For the Burger
Dome waiting line, and in general for most customer oriented
waiting lines, the units waiting for service are arranged on a
first-come, first. served basis; this approach is referred to as an
FCFS queue discipline. However, some situations call for different
queue disciplines. For example, when people wait for an elevator,
the last one on the elevator is often the first one to complete
service (i.e., the first to leave the elevator). Other types of queue
disciplines assign priorities to the waiting units and then serve the
unit with the highest priority first. In this chapter we consider only
waiting lines based on a first-come, first-served queue discipline.

Steady-State Operation
When the Burger Dome restaurant opens in the morning, no
customers are in the restaurant. Gradually, activity builds up to a
normal or steady state. The beginning or start-up period is
referred to as the transient period. The transient period ends
when the system reaches the normal or steady-state operation.
The models described in this handout cover the steady-state
operating characteristics of a waiting line.

STEADY-STATE SINGLE-CHANNEL WAITING LINE

MODEL WITH POISSON ARRIVALS AND EXPONENTIAL
SERVICE TIMES
In this section we present formulas that can be used to determine
the steady-state operating characteristics for a single-channel
waiting line. The formulas are applicable if the arrivals follow a
Poisson probability distribution and the service times follow an
exponential probability distribution. As these assumptions apply to
the Burger Dome waiting line problem introduced in Section 14.1,
we show how formulas can be used to determine Burger Dome's
operating characteristics and thus provide management with helpful
decision-making information.

The mathematical methodology used to derive the formulas for the

operating characteristics of waiting lines is rather complex.
However, our purpose in this chapter is not to provide the
theoretical development of waiting line models, but rather to show
how the formulas that have been developed can provide
information about operating characteristics of the waiting line.
mca-5 467

Operating Characteristics
The following formulas can be used to compute the steady-state
operating characteristics for a single-channel waiting line with
Poisson arrivals and exponential service times, where

 =
the mean number of arrivals per time
period (the mean arrival rate)
 = the mean number of services per time
period (the mean service rate)

Equations (14.4) 1. The probability that no units are in the

through (14.10) do system:
not provide
formulas for
optimal conditions.
Rather, these
equations provide
information about
the steady-state

operating P0  1  (14.4)
characteristics of a 
waiting line.

2. The average number of units in the waiting line:

2
Lq  (14.5)
 (   )

3. The average number of units in the system:


L  Lq  (14.6)


4. The average time a unit spends in the waiting line:

Lq
Wq  (14.7)


5. The average time a unit spends in the system:

1
W  Wq  (14.8)


6. The probability that an arriving units has to wait for service:

7. The probability of n units in the system:

n

Pn    P0 (14.10)
 
PW  (14.9)

mca-5 468

The values of the mean arrival rate  and the mean service rate
 are clearly important components in determining the operating
characteristics. Equation (14.9) shows that the ratio of the mean
arrival rate to the mean service rate,  /  , provides the probability
that an arriving unit has to wait because the service facility is in
use. Hence,  /  often is referred to as the utilization factor for the
service facility.

The operating characteristics presented in equations (14.4) through

(14.10) are applicable only when the mean service rate  is greater
than the mean arrival rate  --in other words, when  /   1 . If this
condition does not exist, the waiting line will continue to grow
without limit because the service facility does not have sufficient
capacity to handle the arriving units. Thus, in using equations (14.4)
through (14.10), we must have   .

Operating Characteristics for the Burger Dome Problem

Recall that for the Burger Dome problem we had a mean arrival
rate of  = 0.75 customers per minute and a mean service rate of
 = 1 customer per minute. Thus, with    , equations (14.4)
through (14.10) can be used to provide operating characteristics for
the Burger Dome single-channel waiting line:
 0.75
P0  1   1  0.25
 1
2 0.752
Lq    2.25 customers
 (    ) 1(1  0.75)
 0.75
L  Lq   2.25   3 customers
 1
Lq 2.25
Wq    3 minutes
 0.75
1 1
W=Wq   3   4 minutes
 1
 0.75
PW    0.75
 1
Equation (14.10) can be used to determine the probability of any
number of customers in the system. Applying it provides the
probability information in Table 14.2.

Use of Waiting Line Models

The results of the single-channel waiting line for Burger Dome show
several important things about the operation of the waiting line. In
particular, customers wait an average of three minutes before
mca-5 469

beginning to place an order, which appears somewhat long for a

business based on fast service.

TABLE .2 STEADY-STATE PROBABILITY OF n CUSTOMERS

IN THE SYSTEM FOR THE BURGER DOME
WAITING LINE PROBLEM

Number of Customers Probability

0
0.2500
1
0.1875
2
0.1406
3
0.1055
4
0.0791
5
0.0593
6
0.0445
7 or more
0.1335

In addition, the facts that the average number of customers waiting

in line is 2.25 and that 75% of the arriving customers have to wait
for service are indicators that something should be done to improve
the waiting line operation. Table 14.2 shows a 0.1335 probability
that seven or more customers are in the Burger Dome system at
one time. This condition indicates a fairly high probability that
Burger Dome will experience some long waiting lines if it continues
to use the single-channel operation.

If the operating characteristics are unsatisfactory in terms of

meeting company standards for service, Burger Dome's
management should consider alternative designs or plans for
improving the waiting line operation.

Improving the Waiting Line Operation

Waiting line models often indicate where improvements in operating
characteristics are desirable. However, the decision of how to
modify the waiting line configuration to improve the operating
characteristics must be based on the insights and creativity of the
analyst.
mca-5 470

After reviewing the operating characteristics provided by the waiting

line model, Burger Dome's management concluded that
improvements designed to reduce waiting times are desirable. To
make improvements in the waiting line operation, analysts often
focus on ways to improve the service rate. Generally, service rate
improvements are obtained by making either or both the following
changes:

1. Increase the mean service rate  by making a

creative design change or by using new technology.
2. Add service channels so that more customers can be
served simultaneously.

Assume that in considering alternative 1, Burger Dome's

management decides to employ an order filler who will assist the
order taker at the cash register. The customer begins the service
process by placing the order with the order taker. As the order is
placed, the order taker announces the order over an intercom
system, and the order filler begins filling the order. When the order
is completed, the order taker handles the money, while the order
filler continues to fill the order. With this design, Burger Dome's
management estimates the mean service rate can be increased
from the current service rate of 60 customers per hour to 75
customers per hour. Thus, the mean service rate for the revised
system is,  = 75 customers/60 minutes = 1.25 customers per
minute. For  = 0.75 customers per minute and  = 1.25
customers per minute, equations (14.4) through (14.10) can be
used to provide the new operating characteristics for the Burger
Dome waiting line. These operating characteristics are summarized
in Table 14.3.

TABLE .3 OPERATING CHARACTERISTICS FOR THE

BURGER DOME SYSTEM WITH THE MEAN SERVICE
RATE INCREASED TO  = 1.25 CUSTOMERS PER MINUTE

Probability of no customers in the system 0.400

Average number of customers in the waiting line 0.900
Average number of customers in the system 1.500
Average time in the waiting line 1.200 minutes
Average time in the system 2.000 minutes
Probability that an arriving customer has to wait 0.600
Probability that seven or more customers are in the system 0.028

The information in Table 14.3 indicates that all operating

characteristics have improved because of the increased service
rate. In particular, the average time a customer spends in the
waiting line has been reduced from 3 to 1.2 minutes and the
average time a customer spends in the system has been reduced
from 4 to 2 minutes. Are any other alternatives available that Burger
mca-5 471

Dome can use to increase the service rate? If so, and if the mean
service rate u can be identified for each alternative, equations
(14.4) through (14.10) can be used to determine the revised
operating characteristics and any improvements in the waiting line
system. The added cost of any proposed change can be compared
to the corresponding service improvements to help the manager
determine whether the proposed service improvements are
worthwhile.

As mentioned previously, another option often available is to

provide one or more additional service channels so that more than
one customer may be served at the same time. The extension of
the single-channel waiting line model to the multiple-channel
waiting line model is the topic of the next section.

Notes:
1. The assumption that arrivals follow a Poisson probability
distribution is equivalent to the assumption that the time
between arrivals has an exponential probability distribution.
For example, if the arrivals for a waiting line follow a Poisson
probability distribution with a mean of 20 arrivals per hour,
the time between arrivals will follow an exponential
probability distribution, with a mean time between arrivals of
1/20 or 0.05 hour.

2. Many individuals believe that whenever the mean service

rate  is greater than the mean arrival rate  , the system
should be able to handle or serve all arrivals. However, as
the Burger Dome example shows, the variability of arrival
times and service times may result in long waiting times
even when the mean service rate exceeds the mean arrival
rate. A contribution of waiting line models is that they can
point out undesirable waiting line operating characteristics
even when they    condition appears satisfactory.

STEADY-STATE MULTIPLE-CHANNEL WAITING LINE

MODEL WITH POISSON ARRIVALS AND EXPONENTIAL
SERVICE TIMES

You may be A multiple-channel waiting line consists of

familiar with
multiple-channel two or more service channels that are
systems that also assumed to be identical in terms of service
have multiple
waiting lines. The capability. In the multiple-channel system,
waiting line model arriving units wait in a single waiting line and
in this section has
multiple channels, then move to the first available channel to be
but only a single
waiting line.
Operating
characteristics for
a multiple-channel
system are better
mca-5 472

served. The single-channel Burger Dome

operation can be expanded to a two-channel
system by opening a second service channel.
Figure 14.3 shows a diagram of the Burger
Dome two-channel waiting line.

In this section we present formulas that can be

used to determine the steady-state operating
characteristics for a multiple-channel waiting
line. These formulas are applicable if the
following conditions exist.

1. The arrivals follow a Poisson probability distribution.

2. The service time for each channel follows an exponential
probability distribution.
3. The mean service rate  is the same for each channel.
4. The arrivals wait in a single waiting line and then move to
the first open channel for service.

System

Channel 1

Server A

Customer Customer Goes Customer

to Next Open Leaves
Arrivals
Channel After Order
Waiting Line Is Filled
Channel 2

Server B

Figure 14.3: Two-channel (k=2) Queue (assuming M/M/2

conditions)

M/M/k Steady-State Operating Characteristics

The following formulas can be used to compute the steady-state

operating characteristics for multiple-channel waiting lines, where

  the mean arrival rate for the system

  the mean service rate for each channel
k the number of channels
mca-5 473

1. The probability that no units are in the system:

1
P0  (14.11)
k 1
( /  ) n ( /  ) k  k 
   
n0 n! k!  k   

2. The average number of units in the waiting line:

( /  ) k 
Lq  P0 (14.12)
( k  1)!(k    ) 2

3. The average number of units in the system:


L  Lq  (14.13)


4. The average time a unit spends in the waiting line:

Lq
Wq  (14.14)


5. The average time a unit spends in the system:

1
W  Wq  (14.15)


6. The probability that an arriving unit has to wait for

service:

k
1     k 
Pw      P0 (14.16)
k !    k   

7. The probability of n units in the system:

( /  ) n
Pn  P0 for n  k (14.17)
n!
( /  ) n
Pn  P0 for n  k (14.18)
k !k ( n  k )

Because  is the mean service rate for each channel, k  is the

mean service rate for the multiple-channel system. As was true for
the single-channel waiting line model, the formulas for the operating
characteristics of multiple-channel waiting lines can be applied only
in situations where the mean service rate for the system is greater
than the mean arrival rate for the system; in other words, the
formulas are applicable only if k  is greater than  .
mca-5 474

Some expressions for the operating characteristics of

multiple-channel waiting lines are more complex than their
single-channel counterparts. However, equations (14.11) through
(14.18) provide the same information as provided by the
single-channel model. To help simplify the use
of the multiple-channel equations, Table 14.4 contains values of P0
for selected values of  /  and k. The values provided in the table
correspond to cases where k    , and hence the service rate is
sufficient to process all arrivals.

Operating Characteristics for the Burger Dome Problem

To illustrate the multiple-channel waiting line model, we return to
the Burger Dome fastfood restaurant waiting line problem. Suppose
that management wants to evaluate the desirability of opening a
second order-processing station so that two customers can be
served simultaneously. Assume a single waiting line with the first
customer in line moving to the first available server. Let us evaluate
the operating characteristics for this two channel system.

We use equations (14.12) through (14.18) for the k = 2 channel

system. For a mean arrival rate of  = 0.75 customers per minute
and mean service rate of  = 1 customer per minute for each
channel, we obtain the operating characteristics:

P0  0.4545 (from Table 14.4 with  /   0.75)

(0.75 /1) 2 (0.75)(1)
Lq  (0.4545)  0.1227 customer
(2  1)![2(1)  0.75]2
 0.75
L  Lq   0.1227   0.8727 customer
 1
Lq 0.1227
Wq    0.1636 minute
 0.75
1 1
W = Wq   0.1636   1.1636 minutes
 1
1  0.75   2(1) 
2

PW      (0.4545)  0.2045
2!  1   2(1)  0.75 

Using equations (14.17) and (14.18), we can compute the

probabilities of n customers in the system. The results from these
computations are summarized in Table 14.5.
mca-5 475

TABLE .4 VALUES OF P0 FOR MULTIPLE-CHANNEL

WAITING LINES WITH
POISSON ARRIVALS AND EXPONENTIAL
SERVICE TIMES

Number of Channels (k)

Ratio / 2 3 4 5
0.15 0.8605 0.8607 0.8607 0.8607
0.20 0.8182 0.8187 0.8187 0.8187
0.25 0.7778 0.7788 0.7788 0.7788
0.30 0.7391 0.7407 0.7408 0.7408
0.35 0.7021 0.7046 0.7047 0.7047
0.40 0.6667 0.6701 0.6703 0.6703
0.45 0.6327 0.6373 0.6376 0.6376
0.50 0.6000 0.6061 0.6065 0.6065
0.55 0.5686 0.5763 0.5769 0.5769
0.60 0.5385 0.5479 0.5487 0.5488
0.65 0.5094 0.5209 0.5219 0.5220
0.70 0.4815 0.4952 0.4965 0.4966
0.75 0.4545 0.4706 0.4722 0.4724
0.80 0.4286 0.4472 0.4491 0.4493
0.85 0.4035 0.4248 0.4271 0.4274
0.90 0.3793 0.4035 0.4062 0.4065
0.95 0.3559 0.3831 0.3863 0.3867
1.00 0.3333 0.3636 0.3673 0.3678
1.20 0.2500 0.2941 0.3002 0.3011
1.40 0.1765 0.2360 0.2449 0.2463
1.60 0.1111 0.1872 0.1993 0.2014
1.80 0.0526 0.1460 0.1616 0.1646
2.00 0.1111 0.1304 0.1343
2.20 0.0815 0.1046 0.1094.
2.40 0.0562 0.0831 0.0889
2.60 0.0345 0.0651 0.0721
2.80 0.0160 0.0521 0.0581
3.00 0.0377 0.0466
3.20 0.0273 0.0372
3.40 0.0186 0.0293
3.60 0.0113 0.0228
3.80 0.0051 0.0174
4.00 0.0130
4.20 0.0093
4.40 0.0063
4.60 0.0038
4.80 0.0017
mca-5 476

We can now compare the steady-state operating characteristics of

the two-channel system to the operating characteristics of the
original single-channel system discussed in Section 14.2.

1. The average time a customer spends in the system (waiting

time plus service time); is reduced from W = 4 minutes to W
= 1.1636 minutes.
2. The average number of customers in the waiting line is
reduced from Lq = 2.25 customers to Lq = 0.1227 customers.

TABLE .5 STEADY-STATE PROBABILITY OF n CUSTOMERS

IN THE SYSTEM FOR THE BURGER DOME TWO-
CHANNEL WAITING LINE
Number of Customers Probability
0 0.4545
1 0.3409
2 0.1278
3 0.0479
4 0.0180
5 or more 0.0109

3. The average time a customer spends in the waiting line is

reduced from Wq = 3 minutes to Wq = 0.1636 minutes.
4. The probability that a customer has to wait for service is
reduced from PW = 0.75 to PW = 0.2045.

Clearly the two-channel system will significantly improve the

operating characteristics of the waiting line. However, adding an
order filler at each service station would further increase the mean
service rate and improve the operating characteristics. The final
decision regarding the staffing policy at Burger Dome rests with the
Burger Dome management. The waiting line study simply provides
the operating characteristics that can be anticipated under three
configurations: a single-channel system with one employee, a
single-channel system with two employees, and a two-channel
system with an employee for each channel. After considering these
results, what action would you recommend? In this case, Burger
Dome adopted the following policy statement: For periods when
customer arrivals are expected to average 45 customers per hour,
Burger Dome will open two order-processing channels with one
employee assigned to each.

By changing the mean arrival rate  , to reflect arrival rates at

different times of the day, and then computing the operating
characteristics, Burger Dome's management can establish
guidelines and policies that tell the store managers when they
should schedule service operations with a single channel, two
channels, or perhaps even three or more channels.
mca-5 477

NOTE: The multiple-channel waiting line model is based on a

single waiting line. You may have also encountered situations
where each of the k channels has its own waiting line. Quantitative
analysts have shown that the operating characteristics of
multiple-channel systems are better if a single waiting line is used.
People like them better also; no one who comes in after you can
be served ahead of you. Thus, when possible, banks, airline
reservation counters, food-service establishments, and other
businesses frequently use a single waiting line for a multiple-
channel system.

.4 LITTLE’S GENERAL RELATIONSHIPS FOR STEADY-

STATE WAITING LINE MODELS

In Sections 14.2 and 14.3 we presented formulas for computing the

operating characteristics for single-channel and multiple-channel
waiting lines with Poisson arrivals and exponential service times.
The operating characteristics of interest included

Lq = the average number of units in the waiting

line
L = the average number of units in the system
Wq = the average time a unit spends in the
waiting line
W = the average time a unit spends in the
system

John Little showed that several relationships exist among these

four characteristics and that these relationships apply to a variety of
different waiting line systems. Two of the relationships, referred to
as Little's flow equations, are

LW (14.19)
Lq   Wq 14.20 

Equation (14.19) shows that the average number of units in the

system, L, can be found by multiplying the mean arrival rate,  , by
the average time a unit spends in the system, W.

Equation (14.20) shows that the same relationship holds between

the average number of units in the waiting line, Lq, and the average
time a unit spends in the waiting line, Wq.

Using equation (14.20) and solving for Wq, we obtain

Lq
Wq  (14.21)

mca-5 478

Equation (14.21) follows directly from Little's second flow equation.

We used it for the single-channel waiting line model in Section 14.2
and the multiple-channel waiting line model in Section 14.3 [see
equations (14.7) and (14.14)]. Once Lq is computed for either of
these models, equation (14.21) can then be used to compute Wq.

Another general expression that applies to waiting tine models is

that the average time in the system, W, is equal to the average time
in the waiting line, Wq, plus the average service time. For a system
with a mean service rate  , the mean service time is 1/  . Thus, we
have the general relationship

1
W  Wq  (14.22) to provide the average
Recall that we used equation (14.22) time in

the system for both the single- and multiple-channel waiting line
models [see equations (14.8) and (14.15)].
The importance of Little's flow equations is that they apply to any
waiting line model regardless of whether arrivals follow the Poisson
probability distribution and regardless of whether service times
follow the exponential probability distribution. For example, in study
of the grocery checkout counters at Murphy's Foodliner, an analyst
concluded that arrivals follow the Poisson probability distribution
with the mean arrival rate of 24 customers per hour or  = 24/60 =
0.40 customers per minute. However, the analyst found that
service times follow a normal probability distribution rather than an
exponential probability distribution. The mean service rate was
found to be 30 customers per hour or  = 30/60 = 0.50 customers
per minute. A time study of actual customer waiting times showed
that, on average, a customer spends 4.5 minutes in the system
(waiting time plus checkout time); that is, W = 4.5. Using the
waiting line relationships discussed in this section, we can now
compute other operating characteristics for this waiting line. First,
using equation (14.22) and solving for Wq, we have
1 1
Wq  W   4.5   2.5 minutes
 0.50

With both W and Wq known, we can use Little’s flow equations,

(14.19) and (14.20), to

L   W  0.40(4.5)  1.8 customers

Lq   Wq  0.40(2.5)  1 customer
Murphy’s Foodliner can now review these operating characteristics
to see whether action should be taken to improve the service and to
reduce the waiting time and the length of the waiting line.

Note: In waiting line systems where the length of the waiting line is
limited (e.g., a small waiting area), some arriving units will be
blocked from joining the waiting line and will be lost. In this case,
mca-5 479

the blocked or lost arrivals will make the mean number of units
entering the system something less than the mean arrival rate. By
defining  as the mean number of units joining the system, rather
than the mean arrival rate, the relationships discussed in this
section can be used to determine W, L, Wq, and Wq.

.5 ECONOMIC ANALYSIS OF WAITING LINES

Frequently, decisions involving the design of waiting lines will be

based on a subjective evaluation of the operating characteristics of
the waiting line. For example, a manager may decide that an
average waiting time of one minute or less and an average of two
customers or fewer in the system are reasonable goals. The waiting
line models presented in the preceding sections can be used to
determine the number of channels that will meet the manager's
waiting line performance goals.

On the other hand, a manager may want to identify the cost of

operating the waiting line system and then base the decision
regarding system design on a minimum hourly or daily operating
cost. Before an economic analysis of a waiting line can be
conducted, a total cost model, which includes the cost of waiting
and the cost of service, must be developed.

To develop a total cost model for a waiting line, we begin by

defining the notation to be used:

Waiting cost is cW = the waiting cost per time period for

based on average
Adding more
number of units in each unit
channels
the system.
always
It L = the average number of units in the system
improves
includes the time
the
operating
spent waiting in line
cs = the service cost per time period for each
characteristics of channel
theplus
being
the time spent
waiting line
served. the
and reduces k = the number of channels
waiting cost. TC = the total cost per time period
However,
additional
channels The total cost is the sum of the waiting cost and
increase the the service cost; that is,
service cost. An
economic TC = cwL + csk (14.23)
analysis of
waiting lines
attempts to find
the number of To conduct an economic analysis of a waiting
channels that will line, we must obtain reasonable estimates of
the waiting cost and the service cost. Of these two costs, the
waiting cost is usually the more difficult to evaluate. In the Burger
Dome restaurant problem, the waiting cost would'; be the cost per
minute for a customer waiting for service. This cost is not a direct
cost to Burger Dome. However, if Burger Dome ignores this cost
and allows long waiting lines, customers ultimately will take their
mca-5 480

business elsewhere. Thus, Burger Dome will experience lost sales

and, in effect, incur a cost.

The service cost is generally easier to determine. This cost is the

relevant cost associated with operating each service channel. In the
Burger Dome problem, this cost would include the server's wages,
benefits, and any other direct costs associated with operating the
service channel. At Burger Dome, this cost is estimated to be $7
per hour.

To demonstrate the use of equation (14.23), we

assume that Burger Dome is willing to: assign a
cost of $10 per hour for customer waiting time.
We use the average number of units in the
system, L, as computed in Sections
14.2 and 14.3 to obtain the total hourly cost for
the single-channel and two-channel systems:

Thus, based on the cost data provided by Burger Dome, the

two-channel system provides the most economical operation.
Harper note: The value assigned to the waiting time of customers
was very high in this problem. In most cases, the value assigned is
less than the cost of a server or employee. Thus I would
recommend something like 10-50% of the employee cost be
assigned to the customer waiting time. Keep in mind that this is an
intangible cost and is used solely to help balance good customer
service with real operational staffing costs such as the cost of
employees.

Figure 14.4 shows the general shape of the cost curves in the
economic analysis of waiting lines. The service cost increases as
the number of channels is increased. However, with more
channels, the service is better. As a result, waiting time and cost
decrease as the number of channels is increased. The number of
channels that will provide a good approximation of the minimum
total cost design can be found by evaluating the total cost for
several design alternatives.

FIGURE .4 THE GENERAL SHAPE OF WAITING COST,

SERVICE COST, AND
TOTAL COST CURVES IN WAITING LINE
MODELS

Total Cost
Total
Cost Service Cost
per Hour
mca-5 481

Single-channel system (L = 3 customers):

TC= cwL + csk = 10(3) + 7(1) = $37.00 per hour
Two-channel system (L = 0.8727 customer):
TC = cwL + csk = 10(0.8727) + 7(2) = $22.73 per hour

Waiting Cost

Number of Channels (k)

.6 OTHER WAITING LINE MODELS

D. G. Kendall suggested a notation that is helpful in classifying the

wide variety of different waiting line models that have been
developed. The three-symbol Kendall notation is as follows:

A/B/k
where
A denotes the probability distribution for the arrivals
B denotes the probability distribution for the service time
k denotes the number of channels

Depending on the letter appearing in the A or B position, a variety of waiting line

systems can be described. The letters that are commonly used are as follows:

M designates a Poisson probability distribution for the arrivals

or an exponential probability distribution for service time.
The M stands for Markovian (a memory-less distribution).

D designates that the arrivals or the service time is deterministic

or constant

G designates that the arrivals or the service time has a general

probability distribution with a known mean and variance

Using the Kendall notation, the single-channel waiting line model

with Poisson arrivals and exponential service times is classified as
an M/M/1 model. The two-channel waiting line model with Poisson
arrivals and exponential service times presented in Section 14.3
would be classified as an M/M/2 model.
mca-5 482

NOTES AND COMMENTS

In some cases, the Kendall

notation is extended to five
symbols. The fourth symbol
indicates the largest number
of units that can be in the
system, and the fifth symbol
indicates the size of the
population. The fourth symbol
is used in situations where the
waiting line can hold a finite or
maximum number of units,
and the fifth symbol is
necessary when the
population of arriving units or
customers is finite. When the
fourth and fifth symbols, of the
Kendall notation are omitted,
the waiting line; system is
assumed to have infinite
capacity, and the population is
assumed to be infinite.
Single channel queuing theory
Example
On an average ,6 customers reach a telephone booth every hour
to make calls. determine the probability that exactly 4 customers
will reach in 30 minute period, assuming that arrivals follow
Poisson distribution.

Queueing Theory

To solve the queueing system we rely on the underlying Markov

chain theory. However we can abstract away from this and use the
traffic intensity to derive measures for the queue. The traffic
intensity, t, is the service time divided by the inter-arrival time. From
this we calculate the measures used in the applet as follows:

Let pi represent the probability that their are i customers in the

system. The probability that the system is idle, p0, (ie the Markov
chain is in state 0) is given by

p0 = 1 - t.

The Utilisation, U, of the system is 1 - p0. ie. the proportion of the

time that it is not idle.

U = 1 - p0 = t.

The probability that the queue is non-empty, B, is the probability of

not being in state 0 or state 1 of the Markov chain ie.

1-p0-p1 = 1 - (1-t) - ((1-t)t) = 1 -1 + t - t + t2 = t2.

The expectation of the number of customers in the service centre,

N, is the sum over all states of the number of customers multiplied
by the probability of being in that state.

This works out to be t/(1-t).

The expectation of the number of customers in the queue is

calculated similarly but one multiplies by one less than the number
of customers.

This works out to be t2/(1-t).

Markov chains

A Markov chain is a system which has a set S of states and

changes randomly between these states in a sequence of discrete
steps. The length of time spent in each state is the 'sojourn time' in
mca-1 484

that state, T. This is an exponentially distributed random variable

and in state i has parameter qi. If the system is in state i and makes
a transition then it has a fixed probability, pij, of being in state j.

We can construct a Markov chain from a queueing system as

follows; assign each possible configuration of the queue a state,
define the probability of moving from onestate to another by the
probability of a customer arriving or departing. Thus state 0
corresponds to there being no customers in the system, state 1 to
there being one customer and so on. If we are in state i, then the
probability of moving to state i-1 is the probability of a customer
departing the system, and the probability of moving to state i+1 is
the probability of a customer arriving in the system (apart from the
special case of state 0 when we cannot have a departure).

The fact that we can construct Markov chains from queueing

systems means we can use standard techniques from Markov
chain theory to find, for example, the probability of the queue
having a particular number of customers (by finding the probability
of the corresponding Markov chain being in the corresponding
state).

Queueing Theory Basics

We have seen that as a system gets congested, the service delay
in the system increases. A good understanding of the relationship
between congestion and delay is essential for designing effective
congestion control algorithms. Queuing Theory provides all the
tools needed for this analysis. This article will focus on
understanding the basics of this topic.

Communication Delays
Before we proceed further, lets understand the different
components of delay in a messaging system. The total delay
experienced by messages can be classified into the following
categories:

Processing Delay  This is the delay between the time of

receipt of a packet for transmission to
the point of putting it into the
transmission queue.
 On the receive end, it is the delay
between the time of reception of a
packet in the receive queue to the
mca-1 485

point of actual processing of the

message.
 This delay depends on the CPU speed
and CPU load in the system.

Queuing Delay  This is the delay between the point of

entry of a packet in the transmit queue
to the actual point of transmission of
the message.
 This delay depends on the load on the
communication link.

Transmission  This is the delay between the

Delay transmission of first bit of the packet to
the transmission of the last bit.
 This delay depends on the speed of
the communication link.

Propagation Delay  This is the delay between the point of

transmission of the last bit of the
packet to the point of reception of last
bit of the packet at the other end.
 This delay depends on the physical
characteristics of the communication
link.

Retransmission  This is the delay that results when a

Delay packet is lost and has to be
retransmitted.
 This delay depends on the error rate
on the link and the protocol used for
retransmissions.

In this article we will be dealing primarily with queueing delay.

Little's Theorem
We begin our analysis of queueing systems by understanding
Little's Theorem. Little's theorem states that:

The average number of customers (N) can be determined from the

following equation:
mca-1 486

Here lambda is the average customer arrival rate and T is the

average service time for a customer.

Proof of this theorem can be obtained from any standard textbook

on queueing theory. Here we will focus on an intuitive
understanding of the result. Consider the example of a restaurant
where the customer arrival rate (lambda) doubles but the
customers still spend the same amount of time in the restaurant (T).
This will double the number of customers in the restaurant (N). By
the same logic if the customer arrival rate remains the same but the
customers service time doubles, this will also double the total
number of customers in the restaurant.

Queueing System Classification

With Little's Theorem, we have developed some basic
understanding of a queueing system. To further our understanding
we will have to dig deeper into characteristics of a queueing system
that impact its performance. For example, queueing requirements
of a restaurant will depend upon factors like:

 How do customers arrive in the restaurant? Are customer

arrivals more during lunch and dinner time (a regular
restaurant)? Or is the customer traffic more uniformly
distributed (a cafe)?
 How much time do customers spend in the restaurant? Do
customers typically leave the restaurant in a fixed amount of
time? Does the customer service time vary with the type of
customer?
 How many tables does the restaurant have for servicing
customers?

The above three points correspond to the most important

characteristics of a queueing system. They are explained below:

Arrival Process  The probability density distribution that

determines the customer arrivals in the
system.
 In a messaging system, this refers to
the message arrival probability
distribution.

Service Process  The probability density distribution that

determines the customer service times
in the system.
 In a messaging system, this refers to
the message transmission time
mca-1 487

distribution. Since message

transmission is directly proportional to
the length of the message, this
parameter indirectly refers to the
message length distribution.

Number of Servers  Number of servers available to service

the customers.
 In a messaging system, this refers to
the number of links between the
source and destination nodes.

Based on the above characteristics, queueing systems can be

classified by the following convention:

A/S/n

Where A is the arrival process, S is the service process and n is the

number of servers. A and S are can be any of the following:

M (Markov) Exponential probability density

D (Deterministic) All customers have the same value
G (General) Any arbitrary probability distribution

Examples of queueing systems that can be defined with this

convention are:

 M/M/1: This is the simplest queueing system to analyze.

Here the arrival and service time are negative exponentially
distributed (poisson process). The system consists of only
one server. This queueing system can be applied to a wide
variety of problems as any system with a very large number
of independent customers can be approximated as a
Poisson process. Using a Poisson process for service time
however is not applicable in many applications and is only a
crude approximation. Refer to M/M/1 Queueing System for
details.
 M/D/n: Here the arrival process is poisson and the service
time distribution is deterministic. The system has n servers.
(e.g. a ticket booking counter with n cashiers.) Here the
service time can be assumed to be same for all customers)
 G/G/n: This is the most general queueing system where the
arrival and service time processes are both arbitrary. The
system has n servers. No analytical solution is known for this
queueing system.
mca-1 488

Poisson Arrivals
M/M/1 queueing systems assume a Poisson arrival process. This
assumption is a very good approximation for arrival process in real
systems that meet the following rules:

1. The number of customers in the system is very large.

2. Impact of a single customer on the performance of the
system is very small, i.e. a single customer consumes a very
small percentage of the system resources.
3. All customers are independent, i.e. their decision to use the
system are independent of other users.

Cars on a Highway

As you can see these assumptions are fairly general, so they apply
to a large variety of systems. Lets consider the example of cars
entering a highway. Lets see if the above rules are met.

1. Total number of cars driving on the highway is very large.

2. A single car uses a very small percentage of the highway
resources.
3. Decision to enter the highway is independently made by
each car driver.

The above observations mean that assuming a Poisson arrival

process will be a good approximation of the car arrivals on the
highway. If any one of the three conditions is not met, we cannot
assume Poisson arrivals. For example, if a car rally is being
conducted on a highway, we cannot assume that each car driver is
independent of each other. In this case all cars had a common
reason to enter the highway (start of the race).

Telephony Arrivals

Lets take another example. Consider arrival of telephone calls to a

telephone exchange. Putting our rules to test we find:

1. Total number of customers that are served by a telephone

exchange is very large.
2. A single telephone call takes a very small fraction of the
systems resources.
3. Decision to make a telephone call is independently made by
each customer.
mca-1 489

Again, if all the rules are not met, we cannot assume telephone
arrivals are Poisson. If the telephone exchange is a PABX catering
to a few subscribers, the total number of customers is small, thus
we cannot assume that rule 1 and 2 apply. If rule 1 and 2 do apply
but telephone calls are being initiated due to some disaster, calls
cannot be considered independent of each other. This violates rule
3.

Poisson Arrival Process

Now that we have established scenarios where we can assume an

arrival process to be Poisson. Lets look at the probability density
distribution for a Poisson process. This equation describes the
probability of seeing n arrivals in a period from 0 to t.

Where:

 t is used to define the interval 0 to t

 n is the total number of arrivals in the interval 0 to t.
 lambda is the total average arrival rate in arrivals/sec.

Negative Exponential Arrivals

We have seen the Poisson probability distribution. This equation

gives information about how the probability is distributed over a
time interval. Unfortunately it does not give an intuitive feel of this
distribution. To get a good grasp of the equation we will analyze a
special case of the distribution, the probability of no arrivals taking
place over a given interval.

Its easy to see that by substituting n with 0, we get the following

equation:

This equation shows that probability that no arrival takes place

during an interval from 0 to t is negative exponentially related to the
length of the interval. This is better illustrated with an example.

Consider a highway with an average of 1 car arriving every 10

seconds (0.1 cars/second arrival rate). The probability distribution
with t is given below. You can see here that the probability of not
seeing a single car on the highway decreases dramatically with the
mca-1 490

observation period. If you observe the highway for a period of 1

second, there is 90% chance that no car will be seen during that
period. If you monitor the highway for 20 seconds, there is only a
10% chance that you will not see a car on the highway. Put another
way, there is only a 10% chance two cars arrive less than one
second apart. There is a 90% chance that two cars arrive less than
20 seconds apart.

In the figure below, we have just plotted the impact of one arrival
rate. If another graph was plotted after doubling the arrival rate (1
car every 5 seconds), the probability of not seeing a car in an
interval would fall much more steeply.

Poisson Service Times

In an M/M/1 queueing system we assume that service times for
customers are also negative exponentially distributed (i.e.
generated by a Poisson process). Unfortunately, this assumption is
not as general as the arrival time distribution. But it could still be a
reasonable assumption when no other data is available about
service times. Lets see a few examples:

Telephone Call Durations

mca-1 491

Telephone call durations define the service time for utilization of

various resources in a telephone exchange. Lets see if telephone
call durations can be assumed to be negative exponentially
distributed.

1. Total number of customers that are served by a telephone

exchange is very large.
2. A single telephone call takes a very small fraction of the
systems resources.
3. Decision on how long to talk is independently made by each
customer.

From these rules it appears that negative exponential call hold

times are a good fit. Intuitively, the probability of a customers
making a very long call is very small. There is a high probability that
a telephone call will be short. This matches with the observation
that most telephony traffic consists of short duration calls. (The only
problem with using the negative exponential distribution is that, it
predicts a high probability of extremely short calls).

This result can be generalized in all cases where user sessions are
involved.

Transmission Delays

Lets see if we can assume negative exponential service times for

messages being transmitted on a link. Since the service time on a
link is directly proportional to the length of the message, the real
question is that can we assume that message lengths in a protocol
are negative exponentially distributed?

As a first order approximation you can assume so. But message

lengths aren't really independent of each other. Most
communication protocols exchange messages in a certain
sequence, the length distribution is determined by the length of the
messages in the sequence. Thus we cannot assume that message
lengths are independent. For example, internet traffic message
lengths are not distributed in a negative exponential pattern. In fact,
length distribution on the internet is bi-modal (i.e. has two distinct
peaks). The first peak is around the length of a TCP ack message.
The second peak is around the average length of a data packet.

Single Server
With M/M/1 we have a single server for the queue. Suitability of
M/M/1 queueing is easy to identify from the server standpoint. For
example, a single transmit queue feeding a single link qualifies as a
single server and can be modeled as an M/M/1 queueing system. If
mca-1 492

a single transmit queue is feeding two load-sharing links to the

same destination, M/M/1 is not applicable. M/M/2 should be used to
model such a queue.

M/M/1 Results
As we have seen earlier, M/M/1 can be applied to systems that
meet certain criteria. But if the system you are designing can be
modeled as an M/M/1 queueing system, you are in luck. The
equations describing a M/M/1 queueing system are fairly straight
forward and easy to use.

First we define p, the traffic intensity (sometimes called occupancy).

It is defined as the average arrival rate (lambda) divided by the
average service rate (mu). For a stable system the average service
rate should always be higher than the average arrival rate.
(Otherwise the queues would rapidly race towards infinity). Thus p
should always be less than one. Also note that we are talking about
average rates here, instantaneous arrival rate may exceed the
service rate. Over a longer time period, the service rate should
always exceed arrival rate.

Mean number of customers in the system (N) can be found using

the following equation:

You can see from the above equation that as p approaches 1

number of customers would become very large. This can be easily
justified intuitively. p will approach 1 when the average arrival rate
starts approaching the average service rate. In this situation, the
server would always be busy hence leading to a queue build up
(large N).

Lastly we obtain the total waiting time (including the service time):
mca-1 493

Again we see that as mean arrival rate (lambda) approaches mean

service rate (mu), the waiting time becomes very large.

Queuing theory
 Given the important of response time (and
throughtput), we need a means of computing values
for these metrics.

 Our black box model:

 Let'sassume our system is in steady-state (input

rate = output rate).

 The contents of our black box.

 I/Orequests "depart" by being completed by the

server.

Queuing theory
 Elements of a queuing system:
 Request & arrival rate
o This is a single "request for service".
o The rate at which requests are
generated is the arrival rate .

 Server & service rate

o This is the part of the system that
services requests.
o The rate at which requests are
serviced is called the service rate.

 Queue
mca-1 494

o This is where requests wait between

the time they arrive and the time their
processing starts in the server.

Queuing theory
 Useful statistics
 Length queue ,Time queue
o These are the average length of the
queue and the average time a request
spends waiting in the queue .

 Length server ,Time server

o These are the average number of
tasks being serviced and the average
time each task spends in the server .
o Note that a server may be able to
serve more than one request at a time.

 Time system ,Length system

o This is the average time a request
(also called a task) spends in the
system .
o It is the sum of the time spent in the
queue and the time spent in the server.
o The length is just the average number
of tasks anywhere in the system.

Queuing theory
 Useful statistics
 Little's Law
o The mean number of tasks in the
system = arrival rate * mean
response time .

o This is true only for systems in

equilibrium.
 We must
assume any
system we
study (for
this class) is
mca-1 495

in such a
state.

 Server utilization
o This is just

o This must be between 0 and 1.

 If it is larger
than 1, the
queue will
grow
infinitely
long.

o This is also called traffic intensity .

Queuing theory
 Queue discipline
o This is the order in which
requests are delivered to the
server.

o Common orders are FIFO, LIFO, and

random.

o ForFIFO, we can figure out how long a

request waits in the queue by:

o The last parameter is the hardest to

figure out.

o We can just use the formula:

oC is the coefficient of variance, whose

derivation is in the book.
 (don't worry
about how
to derive it -
mca-1 496

this isn't a
class on
queuing
theory.)

Queuing theory
 Example: Given:
 Processor sends 10 disk I/O per second (which are
exponentially distributed).
 Average disk service time is 20 ms.

 On average, how utilized is the disk?

 What is the average time spent in the queue?

o When the service distribution
is exponential, we can use a
simplified formula for the
average time spent waiting in
line:

 What is the average response time for a disk

request (including queuing time and disk service
time)?

Queuing theory
 Basic assumptions made about problems:
 System is in equilibrium.
 Interarrival time (time between two successive requests
arriving) is exponentially distributed.
 Infinite number of requests.
 Server does not need to delay between servicing requests.
 No limit to the length of the queue and queue is FIFO.
 All requests must be completed at some point.

 This is called an M/G/1 queue

o M = exponential arrival
mca-1 497

oG = general service
distribution (i.e. not
exponential)
o 1 = server can serve 1 request
at a time

 It turns out this is a good model for computer

science because many arrival processes turn out to
be exponential .

 Service times , however, may follow any of a

number of distributions.

Disk Performance Benchmarks

 We use these formulas to predict the performance of
storage subsystems.

 We also need to measure the performance of real

systems to:
 Collect the values of parameters needed for prediction.
 To determine if the queuing theory assumptions hold (e.g., to
determine if the queueing distribution model used is valid).

 Benchmarks:
 Transaction processing
o The purpose of these benchmarks is to
determine how many small (and
usually random) requests a system
can satisfy in a given period of time.

o This means the benchmark stresses

I/O rate (number of disk accesses per
second) rather than data rate (bytes of
data per second).

o Banks, airlines, and other large

customer service organizations are
most interested in these systems, as
they allow simultaneous updates to
little pieces of data from many
terminals.

Disk Performance Benchmarks

 TPC-A and TPC-B
mca-1 498

o These are benchmarks designed by

the people who do transaction
processing.

o They measure a system's ability to do

random updates to small pieces of
data on disk.

o As the number of transactions is

increased, so must the number of
requesters and the size of the account
file .
 These restrictions are imposed to
ensure that the benchmark really
measures disk I/O.
 They prevent vendors from adding
more main memory as a database
cache, artificially inflating TPS rates.

 SPEC system-level file server (SFS)
o This benchmark was designed to
evaluate systems running Sun
Microsystems network file service,
NFS.

Disk Performance Benchmarks

 SPEC system-level file server (SFS)
o It was synthesized based on
measurements of NFS systems to
provide a reasonable mix of reads,
writes and file operations.

o Similar to TPC-B, SFS scales the size

of the file system according to the
reported throughput , i.e.,
 It requires that for every 100 NFS
operations per second, the size of the
disk must be increased by 1 GB.
 It also limits average response time to
50ms.

 Self-scaling I/O

This method of I/O benchmarking uses

a program that automatically scales
mca-1 499

several parameters that govern

performance.

 Number of unique bytes touched.

 This parameter governs
the total size of the data
set.
 By making the value large,
the effects of a cache can
be counteracted.

Disk Performance Benchmarks

 Self-scaling I/O
 Percentage of reads.

 Average I/O request size.

 This is scalable since
some systems may work
better with large requests,
and some with small.

 Percentage of sequential requests.

 The percentage of
requests that sequentially
follow (address-wise) the
prior request.
 As with request size, some
systems are better at
sequential and some are
better at random requests.

 Number of processes.
 This is varied to control
concurrent requests, e.g.,
the number of tasks
simultaneously issuing I/O
requests.

Disk Performance Benchmarks

 Self-scaling I/O
mca-1 500

o The benchmark first chooses a

nominal value for each of the five
parameters (based on the system's
performance).
 It then varies each parameter in turn
while holding the others at their
nominal value.
Performance can thus be graphed using
any of five axes to show the effects of
changing parameters on a system's
performance.









CMU Prob-Grad-Notes - Tomasz Tkocz
No ratings yet
CMU Prob-Grad-Notes - Tomasz Tkocz
226 pages
Spreij Measure Theoretic Probability
No ratings yet
Spreij Measure Theoretic Probability
169 pages
Ecture Otes On Robability: MER Amuz
No ratings yet
Ecture Otes On Robability: MER Amuz
88 pages
Lecture Notes On Probability
No ratings yet
Lecture Notes On Probability
95 pages
Lecture Notes On Advanced Probability
No ratings yet
Lecture Notes On Advanced Probability
293 pages
Grad Probability
No ratings yet
Grad Probability
320 pages
Probability Theory Course Notes
No ratings yet
Probability Theory Course Notes
6 pages
Probability Theory - Varadhan
100% (1)
Probability Theory - Varadhan
6 pages
Skript 2022
No ratings yet
Skript 2022
112 pages
Stochastic Processes
No ratings yet
Stochastic Processes
133 pages
Probability Olivier Knill
No ratings yet
Probability Olivier Knill
372 pages
Lecture Notes On Probability Theory Dmitry Panchenko
No ratings yet
Lecture Notes On Probability Theory Dmitry Panchenko
316 pages
Probability I12
No ratings yet
Probability I12
100 pages
(Courant Lecture Notes in Mathematics 7) S. R. S. Varadhan-Probability Theory-Courant Institute of Mathematical Sciences - American Mathematical Society (2001)
0% (1)
(Courant Lecture Notes in Mathematics 7) S. R. S. Varadhan-Probability Theory-Courant Institute of Mathematical Sciences - American Mathematical Society (2001)
227 pages
Probability Theory - Varadhan
No ratings yet
Probability Theory - Varadhan
225 pages
Probability Theory - R.S.varadhan
No ratings yet
Probability Theory - R.S.varadhan
225 pages
(Mathematics and Its Applications) Malempati M. Rao, Randall J. Swift - Probability Theory With Applications - Springer (2006)
No ratings yet
(Mathematics and Its Applications) Malempati M. Rao, Randall J. Swift - Probability Theory With Applications - Springer (2006)
536 pages
Probability Theory Cookbook
No ratings yet
Probability Theory Cookbook
63 pages
Probability - Oliver Knill (Havard)
No ratings yet
Probability - Oliver Knill (Havard)
380 pages
Stochastic Processes Lectures
No ratings yet
Stochastic Processes Lectures
132 pages
Probability Theory
100% (2)
Probability Theory
149 pages
ProbStochProc 1.42 NoSolns PDF
No ratings yet
ProbStochProc 1.42 NoSolns PDF
241 pages
ProbStochProc 1.42 NoSolns PDF
No ratings yet
ProbStochProc 1.42 NoSolns PDF
241 pages
KTH Probability Theory Notes
No ratings yet
KTH Probability Theory Notes
348 pages
Probability Theory An Analytic View PDF
100% (4)
Probability Theory An Analytic View PDF
551 pages
KTH Probability Theory Lecture Notes
No ratings yet
KTH Probability Theory Lecture Notes
346 pages
Part 01
No ratings yet
Part 01
5 pages
Probability-2 Shiryaev
100% (3)
Probability-2 Shiryaev
356 pages
Intuition To Probability (Version 1.19)
No ratings yet
Intuition To Probability (Version 1.19)
396 pages
Notes Part1
100% (1)
Notes Part1
70 pages
02.0 PP Ix Xiv Contents
No ratings yet
02.0 PP Ix Xiv Contents
6 pages
Advanced Probability Concepts
No ratings yet
Advanced Probability Concepts
80 pages
Sanet - ST 1009179918
No ratings yet
Sanet - ST 1009179918
191 pages
Stochastic Processes - Jiahua Chen
No ratings yet
Stochastic Processes - Jiahua Chen
113 pages
Stochastic Processes: Stat433/833 Lecture Notes
No ratings yet
Stochastic Processes: Stat433/833 Lecture Notes
113 pages
Probability Theory Lecture Notes
No ratings yet
Probability Theory Lecture Notes
111 pages
Theory of Probability Zitcovic PDF
No ratings yet
Theory of Probability Zitcovic PDF
162 pages
Theory of Probability: Lecture Notes
No ratings yet
Theory of Probability: Lecture Notes
162 pages
Almost None
No ratings yet
Almost None
347 pages
011 Cours en
No ratings yet
011 Cours en
93 pages
A Historical Introduction - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - .
No ratings yet
A Historical Introduction - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - .
10 pages
Probability P
No ratings yet
Probability P
66 pages
Lecture Notes - Probability Theory: Manuel Cabral Morais
No ratings yet
Lecture Notes - Probability Theory: Manuel Cabral Morais
297 pages
Discrete Time
No ratings yet
Discrete Time
106 pages
FundProb Notes22
No ratings yet
FundProb Notes22
52 pages
Prob Main
No ratings yet
Prob Main
124 pages
Lecture Notes Fall Term 2013
No ratings yet
Lecture Notes Fall Term 2013
40 pages
Introduction To Probability Theory and Stochastic Processes (Stats)
No ratings yet
Introduction To Probability Theory and Stochastic Processes (Stats)
178 pages
Stoch PDF
No ratings yet
Stoch PDF
84 pages
Probbook
No ratings yet
Probbook
158 pages
Bu LN 447 m330 Solns 2024 11 24
No ratings yet
Bu LN 447 m330 Solns 2024 11 24
345 pages
Chemistry Advanced Level Past Papers Mock
No ratings yet
Chemistry Advanced Level Past Papers Mock
83 pages
Splited
No ratings yet
Splited
5 pages
Induction 1
No ratings yet
Induction 1
4 pages
Math S2
No ratings yet
Math S2
4 pages
Soil
No ratings yet
Soil
4 pages
Mathématiques For Lower Secondary Students
No ratings yet
Mathématiques For Lower Secondary Students
4 pages
Al Practical Chemistry National Exams An
No ratings yet
Al Practical Chemistry National Exams An
38 pages
Arnav 1
No ratings yet
Arnav 1
12 pages
9MA0 02 Que 20201015
No ratings yet
9MA0 02 Que 20201015
52 pages
UNIT V Ratio Proportion
No ratings yet
UNIT V Ratio Proportion
32 pages
Diagnostic Test 1 Mathematics Solved
No ratings yet
Diagnostic Test 1 Mathematics Solved
4 pages
SPM Probability Table
86% (7)
SPM Probability Table
1 page
Rectilinear Kinematics - Hibbler
No ratings yet
Rectilinear Kinematics - Hibbler
547 pages
PRACTICE TEST Year 7 Maths Progress Test 2
100% (8)
PRACTICE TEST Year 7 Maths Progress Test 2
12 pages
Class 10 Math Sample Paper
No ratings yet
Class 10 Math Sample Paper
12 pages
LS On Digital Signal Processing
No ratings yet
LS On Digital Signal Processing
558 pages
0580 - w22 - QP - 41 - Parziale 2
No ratings yet
0580 - w22 - QP - 41 - Parziale 2
15 pages
Divide and Conquer For Nearest Neighbor Problem
No ratings yet
Divide and Conquer For Nearest Neighbor Problem
32 pages
Mdi3002 - Foundations-Of-Data-Science - TH - 1.0 - 66 - Mdi3002 - 61 Acp
No ratings yet
Mdi3002 - Foundations-Of-Data-Science - TH - 1.0 - 66 - Mdi3002 - 61 Acp
2 pages
Multiples, Factors and Primes
No ratings yet
Multiples, Factors and Primes
7 pages
Questions by Love Babbar: Review
No ratings yet
Questions by Love Babbar: Review
11 pages
MATLAB Spring System Analysis
No ratings yet
MATLAB Spring System Analysis
2 pages
Trigonometric Functions of Real Numbers
No ratings yet
Trigonometric Functions of Real Numbers
20 pages
Common Tangents and Tangent Segments 0 PDF
No ratings yet
Common Tangents and Tangent Segments 0 PDF
68 pages
Number Theory Course Overview
No ratings yet
Number Theory Course Overview
643 pages
COR 006 STAT PERFORMANCE TASK - Docx 1
No ratings yet
COR 006 STAT PERFORMANCE TASK - Docx 1
5 pages
Slot03 04 BasicComputation
No ratings yet
Slot03 04 BasicComputation
59 pages
DSP Lab Guide for Students
100% (1)
DSP Lab Guide for Students
67 pages
Discrete Structure
No ratings yet
Discrete Structure
20 pages
C Operators Guide for Beginners
No ratings yet
C Operators Guide for Beginners
55 pages
11.3. Fourier Series and Forced Oscillation 11.4 Approximation by Trigonometric Polynomials
No ratings yet
11.3. Fourier Series and Forced Oscillation 11.4 Approximation by Trigonometric Polynomials
7 pages
Friedman Number
No ratings yet
Friedman Number
3 pages
SPE Estimating Gas Desorption Parameters From Devonian Shale Well Test Data
No ratings yet
SPE Estimating Gas Desorption Parameters From Devonian Shale Well Test Data
12 pages
Linear Algebra and Polynomial Assignment
No ratings yet
Linear Algebra and Polynomial Assignment
2 pages
Laplace Transforms of The Unit Step Function
No ratings yet
Laplace Transforms of The Unit Step Function
5 pages
Artificial Viscosity
No ratings yet
Artificial Viscosity
5 pages
MACM101 Formula
No ratings yet
MACM101 Formula
66 pages