KEMBAR78
ST202 Notes | PDF | Markov Chain | Stochastic Process
0% found this document useful (0 votes)
242 views52 pages

ST202 Notes

The document contains module notes on stochastic processes. It covers the basics of Markov chains, including definitions of Markov chains, transition matrices, and the Markov property. The notes also discuss recurrence and transience of states, branching processes for modeling diseases, invariant distributions of Markov chains, and using Markov chain Monte Carlo methods. Various theorems and examples are provided throughout the notes.

Uploaded by

Charlie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
242 views52 pages

ST202 Notes

The document contains module notes on stochastic processes. It covers the basics of Markov chains, including definitions of Markov chains, transition matrices, and the Markov property. The notes also discuss recurrence and transience of states, branching processes for modeling diseases, invariant distributions of Markov chains, and using Markov chain Monte Carlo methods. Various theorems and examples are provided throughout the notes.

Uploaded by

Charlie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

ST202 Stochastic Processes

Module Notes
Based on notes by Matt Ball, Ed Jackson, and Rosalia Linfield, adapted to the content covered
in 2022/23.

May 8, 2023

Contents
1 Basics of Markov Chains 1
1.1 Definition and basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Class structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Hitting times and absorption probabilities . . . . . . . . . . . . . . . . . . . . . 7
1.4 The strong Markov property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Recurrence and Transience 17

3 Branching Processes 24
3.1 Disease models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 The branching process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 The extinction probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Invariant Distributions 31
4.1 Invariant measures and distributions . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Convergence to equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Time reversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 The Ergodic Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

A Summary 44
A.1 Basics of Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
A.2 Recurrence and transience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
A.3 Branching processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A.4 Invariant distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1 BASICS OF MARKOV CHAINS 1

Introduction
A stochastic process is a set of random variables over an indexing set

{Xt : t ∈ T },

typically T = N or Z. Examples include a sequence of coin flips, the population at a given


time, or the value of a stock. We study discrete time-homogeneous Markov chains: these are
sequences of discrete random variables with the property that knowing the most recent value
of the process causes it to become independent of the past.

1 Basics of Markov Chains


1.1 Definition and basic properties
Let I denote the state space of discrete random variables, where i ∈ I is a state. APprobability
distribution on I will be denoted λ = (λi : i ∈ I) where λi ∈ [0, 1] for all i ∈ I and i∈I λi = 1,
so Z ∼ λ if P(Z = i) = λi for all i ∈ I.

Definition 1.1. A matrix P = (pij : i, j ∈ I) is a stochastic (or transition) matrix if

(i) 0 ⩽ pij ⩽ 1 for all entries pij , and


P
(ii) for all i ∈ I we have j∈I pij = 1.

Each row represents a distribution, since it contains non-negative entries which sum to 1.
A stochastic matrix contains the rules for a Markov chain – these are the possible states we
can move between and the probability of moving between these states. More formally:

Definition 1.2. Let λ be a distribution and P be a stochastic matrix. We say that (Xn )n⩾0 is
a Markov chain with initial distribution λ and stochastic matrix P if

(i) X0 ∼ λ, and

(ii) P(Xn+1 = in+1 | X0 = i0 , . . . , Xn = in ) = pin in+1 for all i0 , . . . , in+1 ∈ I.

We denote this by (Xn )n⩾0 ∼ Markov(λ, P ).

Theorem 1.1. A discrete-time stochastic process (Xn )n⩾0 is Markov(λ, P ) if and only if for
all i0 , . . . , iN ∈ I, we have

P(X0 = i0 , . . . , XN = iN ) = λi0 pi0 i1 · · · piN −1 iN . (1)

Proof. Suppose (Xn ) ∼ Markov(λ, P ). Then, by the multiplication rule,

P(X0 = i0 , . . . , XN = iN )
= P(X0 = i0 )P(X1 = i1 | X0 = i0 ) · · · P(XN = iN | X0 = i0 , . . . , XN −1 = iN −1 )
= λi0 pi0 i1 · · · piN −1 iN .
1 BASICS OF MARKOV CHAINS 2

Conversely, suppose (1) holds. By the law of total probability,


X
P(X0 = i0 , . . . , XN −1 = iN −1 ) = P(X0 = i0 , . . . , XN −1 = iN −1 , XN = iN )
iN ∈I
X
= λi0 pi0 i1 · · · piN −2 iN −1 piN −1 iN
iN ∈I
X
= λi0 pi0 i1 · · · piN −2 iN −1 piN −1 iN
iN ∈I

= λi0 pi0 i1 · · · piN −2 iN −1 .

By induction, this holds for all n ⩽ N , so

P(X0 = i0 , . . . , Xn = in ) = λi0 pi0 i1 · · · pin−1 in

and
P(X0 = i0 , . . . , Xn = in )
P(Xn = in | X0 = i0 , . . . , Xn−1 = in−1 ) =
P(X0 = i0 , . . . , Xn−1 = in−1 )
λi0 pi0 i1 · · · pin−1 in
= = pin−1 in .
λi0 pi0 i1 · · · pin−2 in−1

We now introduce an important distribution, called the unit mass distribution δi := (δij :
j ∈ I) defined by (
1 if i = j,
δij =
0 otherwise.

Theorem 1.2 (Markov property). Let (Xn )n⩾0 ∼ Markov(λ, P ). Then, conditional on
the event {Xm = i}, we have (Xm+n )n⩾0 ∼ Markov(δi , P ) and is independent of the random
variables X0 , . . . , Xm .
Proof. Let

A = {X0 = i0 , . . . , Xm = im }, B = {Xm+1 = im+1 , . . . , Xm+n = im+n }.

We need to show that there is conditional independence for A and B given that Xm = i. We
have
P(X0 = i0 , . . . , Xm = im , . . . , Xm+n = im+n ) · 1{im =i}
P(A ∩ B | Xm = i) =
P(Xm = i)
λi pi i · · · pim−1 im pim im+1 · · · pim+n−1 im+n · δi,im
= 0 01
P(Xm = i)
λi pi i · · · pim−1 im
= 0 01 δi,im pim im+1 · · · pim+n−1 im+n
P(Xm = i)
= P(A | Xm = i)P(B | Xm = i).

So provided that Xm = i, then Xm followed by the random variables in B is Markov with


Xm ∼ δi . Therefore (Xm+n )n⩾0 ∼ Markov(δi , P ).
1 BASICS OF MARKOV CHAINS 3

Example 1.1. Let (Xn )n⩾0 be a Markov chain with state space I = {1, . . . , 5} and the following
stochastic matrix. For clarity, we label the rows and columns according to the corresponding
states.
1 2 3 4 5
1 0 1 0 0 0 
2 1/2 0 0 1/2 0 
P = 3 1/5 0 4/5 0 0 .

4 0 1/3 1/3 0 1/3
 
5 0 0 0 0 1

Each entry of the matrix is the probability of random variable in the state given by its row
moving to the state given by its column. We can represent this Markov chain by a diagram:

1
1 2 2
1
3

1 1 1 4 5 1
2 3

1 3 1
5 3

4
5

Example 1.2 (Gene mutation model). Let (Xn )n⩾0 be a Markov chain representing a gene
mutating between two states I = {1, 2}, with probability α of mutating from state 1 to state 2
and probability β of mutating from state 2 to state 1. Then the stochastic matrix is given by

 1 2 
P = 1 1−α α .
2 β 1−β

The chain is represented by the following diagram:

α
1−α 1 2 1−β
β

We now discuss the probability that a Markov chain is in a given state after n steps. Define

Pi (Xn = j) := P(Xn = j | X0 = i).

This relies on knowing the values of the matrix P n . This is also a stochastic matrix. The
(n)
(i, j)th entry of P n will be denoted pij .
1 BASICS OF MARKOV CHAINS 4

Theorem 1.3 (n-step probabilities). Let (Xn )n⩾0 ∼ Markov(λ, P ). Then


(n)
(i) Pi (Xn = j) = pij ,

(ii) Pλ (Xn = j) := P(Xn = j | X0 ∼ λ) = (λP n )j , and

(iii) Pi (Xn = j) = P(Xm+n = j | Xm = i).

Proof. (i): We prove by induction on n. The base case n = 1 is easy. Suppose the condition
holds for some n. Now
X
Pi (Xn+1 = j) = P(Xn+1 = j, Xn = k | X0 = i)
k∈I
X
= P(Xn = k | X0 = i)P(Xn+1 = j | Xn = k, X0 = i)
k∈I

(n)
By the inductive hypothesis we have P(Xn = k | X0 = i) = pik and by the Markov property
we have P(Xn+1 = j | Xn = k, X0 = i) = pkj . Hence
X (n) (n+1)
Pi (Xn+1 = j) = pik pkj = pij
k∈I

as required.
(ii): We have
X X
Pλ (Xn = j) = P(Xn = j, X0 = k) = P(X0 = k)P(Xn = j | X0 = k).
k∈I k∈I

(n)
By definition P(X0 = k) = λk , and condition (i) tells us that P(Xn = j | X0 = k) = pkj . So
the sum is equal to
(n)
X
λk pkj = (λP n )j .
k∈I

(iii): This follows from the Markov property.


To avoid brute-force computing P n when n or P is large, we discuss two methods of com-
puting n-step probabilities more easily by hand.
Method 1: This method consists of solving a difference equation. We demonstrate this by the
following example.
(n)
Example 1.3. We find a formula for p11 in the stochastic matrix from Example 1.2
 
1−α α
P = .
β 1−β

Observe that P n+1 = P n P , so matrix multiplication gives


(n+1) (n) (n)
p11 = p11 (1 − α) + p12 β.
1 BASICS OF MARKOV CHAINS 5

Now P n is also a stochastic matrix, so


(n) (n)
p11 + p12 = 1.

Combining the two expressions gives


(n+1) (n)
p11 = (1 − α − β)p11 + β
(0)
and we know that p11 = 1, so we have a difference equation with an initial condition. Solving
this gives
(n) β α
p11 = + (1 − α − β)n .
α+β α+β
(The method to solve difference equations is not covered here – see the Week 1 handwritten
notes.)

Method 2: Assume P is diagonalisable. Then there exists a diagonal matrix


 
α1
D= ..
.
 

αd

where α1 , . . . , αd are the eigenvalues of P , such that P = U DU −1 for some invertible matrix U .
We then have  
α1n
P n = U Dn U −1 = U  ..  −1
U
.

αdn
(n)
so pij is of the form
c1 α1n + · · · + cd αdn
for some constants c1 , . . . , cd ∈ C, assuming the eigenvalues are distinct.
To work around possible complex terms, use polar form: if αk = a + ib then αk = reiθ =
(n)
r(cos θ + i sin θ) so αkn = rn (cos(nθ) + i sin(nθ)). Secondly, pij must be real so if any eigenvalues
are complex, we can write them in terms of sin and cos with real coefficients.
(n)
Then we can use boundary conditions, from the first few values of pij , to determine the
constants.
(n)
Example 1.4. We find p11 for the Markov chain with stochastic matrix
 
0 1 0
P =  0 1/2 1/2 .
1/2 0 1/2
(n)
The eigenvalues are 1, i/2, −i/2 and from this we deduce that p11 has the form
 n  n
(n) i i
p11 = c1 + c2 + c3 −
2 2
1 BASICS OF MARKOV CHAINS 6

for some constants c1 , c2 , c3 ∈ C, but we know that it is real, and


 n  n   
i 1 nπ  nπ 
± = cos ± i sin
2 2 2 2
(n)
so we can rewrite p11 in the form
 n 
(n) 1  nπ   nπ 
p11 =a+ b cos + c sin
2 2 2

where a, b, c ∈ R.
(n)
For our boundary conditions, we can just write down the first few values of p11 :
(0)
p11 = 1 = a + b,
(1) 1
p11 = 0 = a + c,
2
(2) 1
p11 = 0 = a − b
4
so a = 1/5, b = 4/5, c = −2/5 and
 n   nπ 
(n) 1 1 4  nπ  2
p11 = + cos − sin .
5 2 5 2 5 2

Remark. We proved in Quiz 1 that 1 is always an eigenvalue of a stochastic matrix.

1.2 Class structure


Definition 1.3. Let i, j ∈ I and P be a stochastic matrix. We say that i talks to j, denoted
(n)
by i → j, if there exists some n ⩾ 0 such that pij > 0.
We say that i communicates with j, denoted by i ↔ j, if there exist m, n ⩾ 0 such that
(n) (m)
pij > 0 and pji > 0.

Proposition 1.1. The relation ↔ induces an equivalence relation on I.


Proof. Exercise.
This equivalence relation partitions I into disjoint subsets called communicating classes.
Definition 1.4. A communicating class C ⊆ I is a maximal collection of states such that for
all i, j ∈ C we have i ↔ j and for every i ∈ C there is no k ∈ C c such that i ↔ k.

Definition 1.5.

(i) A Markov chain is irreducible if i ↔ j for all i, j ∈ I.

(ii) A class C is closed if for every i ∈ C, i only talks to states in C.

(iii) A state i is absorbing if {i} is a closed class.


1 BASICS OF MARKOV CHAINS 7

Example 1.5. Consider the following Markov chain, where the arrows denote nonzero proba-
bility.

10

1 2

5 8

4 3 7

6 9

It is not irreducible, and has the following communicating classes:

• {1, 2, 3} which is not closed;

• {4, 5, 6} and {7, 8, 9} which are closed;

• {10} which is absorbing.

We now introduce the notion of a simple random walk.

Definition 1.6. Let p ∈ [0, 1]. A simple random walk (SRW) on Z is a Markov chain with
state space I = Z, where for all i, j ∈ I

p
 if j = i + 1,
pij = q := 1 − p if j = i − 1,

0 otherwise.

If p = q = 1/2, we call the process a simple symmetric random walk.

Whenever p ∈ (0, 1) a SRW is irreducible with communicating class Z.

p p p p p p
··· −2 −1 0 1 2 ···
q q q q q q

1.3 Hitting times and absorption probabilities


Instead of finding the probability of reaching a state at a certain nth step of the Markov chain,
we may want to consider the probability that we ever reach the state at some point, and at
what step we can expect to reach the state.

Definition 1.7. Let (Xn )n⩾0 ∼ Markov(λ, P ). A random variable T : Ω → N ∪ {∞} is a


stopping time for (Xn )n⩾0 if for every n ⩾ 0 the event {T = n} only depends on X0 , . . . , Xn .
1 BASICS OF MARKOV CHAINS 8

In other words, you do not need any information about future events beyond Xn to know
that the event {T = n} has taken place.

Definition 1.8. Let (Xn )n⩾0 ∼ Markov(λ, P ) and A ⊆ I. The hitting time of A, denoted H A ,
is a stopping time given by

H A (ω) := inf{n ⩾ 0 : Xn (ω) ∈ A}

where we take inf ∅ = ∞.

Definition 1.9. Let (Xn )n⩾0 ∼ Markov(λ, P ) and A ⊆ I. The hitting probability of A from a
state i ∈ I is given by
hA A
i := Pi (H < ∞)

and the expected hitting time of A from i is given by


X
kiA := Ei [H A ] = nPi (H A = n)
n⩽∞

j
If A only contains one state, A = {j}, we denote hA A j
i and H by hi and H respectively.
If A is a closed class, we call hA
i an absorption probability.

Example 1.6. Let A = {2} and consider h21 for the Markov chain given by the stochastic
matrix  
0 1/2 1/2 0
1/2 1/2 0 0 
P = 1/3
.
1/3 0 1/3
0 0 0 1
To compute this, we split according to the next state obtained when moving to X1 . So
X
h21 = P1 (H 2 < ∞) = P1 (H 2 < ∞; X1 = j)
j∈I
X
= P1 (X1 = j)P1 (H 2 < ∞ | X1 = j)
j∈I
X
= p1j Pj (H 2 < ∞)
j∈I
1 1 1 1
= h22 + h23 = + h23 (2)
2 2 2 2
where, by splitting according to previous states,
1 1 1 1 1
h23 = h21 + h22 + h24 = h21 + . (3)
3 3 3 3 3
Solving the simultaneous equations (2) and (3) give
4 3
h21 = , h23 = .
5 5
1 BASICS OF MARKOV CHAINS 9

Theorem 1.4 (Hitting probabilities). The vector of hitting probabilities hA = (hAi : i ∈ I)


for A ⊆ I is the minimal non-negative solution of the system of linear equations

hA
i = X
1 if i ∈ A,
A A
hi = pij hj if i ∈
/ A.
j∈I

Proof. Since hitting times include n = 0 we clearly have hA i = 1 if i ∈ A. Otherwise, consider


the possible states for X1 :
X
hA
i = P i (H A
< ∞) = Pi (H A < ∞; X1 = j)
j∈I
X X
= pij Pj (H A < ∞) = pij hA
j .
j∈I j∈I

Non-negativity follows directly from hA i being a probability.


For minimality, suppose that x = (xi : i ∈ I) is another solution to the system of equations.
Then if i ∈ A we also have xi = 1 = hAi . For i ∈ / A we have
X X X
xi = pij xj = pij + pij xj
j∈I j∈A j ∈A
/
!
X X X X
= pij + pij pjk + pjk xk
j∈A j ∈A
/ k∈A k∈A
/
XX
= Pi (X1 ∈ A) + Pi (X1 ∈
/ A, X2 ∈ A) + pij pjk xk .
j ∈A
/ k∈A
/

By substituting sums for xk in the same way and repeating n times, we obtain

xi = Pi (X1 ∈ A) + Pi (X1 ∈
/ A, X2 ∈ A) + . . .
X X
+ Pi (X1 ∈
/ A, . . . , Xn−1 ∈
/ A, Xn ∈ A) + ··· pj1 j2 · · · pjn−1 jn xjn
j1 ∈A
/ jn ∈A
/

and this is at least Pi (H A ⩽ n). Finally, by taking the limit n → ∞ we obtain

xi ⩾ lim P(H A ⩽ n) = Pi (H A < ∞) = hA


i .
n→∞

Example 1.7 (Gamblers’ ruin). Suppose that a gambler enters a casino with wealth £i and
gambles, £1 at a time, with probability p that their stake is doubled and probability q = 1 − p
of losing it. What is the probability that the gambler will leave broke and how does this depend
on the initial wealth i?
We model this as a truncated SRW with state space I = N and p ∈ (0, 1).

p p
1 0 1 2 ···
q q q
1 BASICS OF MARKOV CHAINS 10

The communicating classes are

C1 = {1, 2, 3, . . . }, C2 = {0}

and the probabilities are given by




 1 if i = j = 0,

p if j = i + 1 ⩾ 2,
pij =


 q if j = i − 1 ⩾ 0,
0 otherwise.

We are interested in the hitting probability of A = {0} from i.


By Theorem 1.4, we have h00 = 1 and

h0i = ph0i+1 + qh0i−1

for i > 0. This defines a difference equation. Provided that p ̸= q (we will return to the case
p = q later) solving this gives general solution
 i
q
h0i =α+β
p

for constants α, β. We have an initial condition h00 = 1, so α + β = 1 and hence we can rewrite
the solution as  i !
q
h0i = 1 + β −1 .
p
The constant β depends on the value of p and q. Consider the following cases:
Case 1 : If q > p, we have q/p > 1, so (q/p)i − 1 > 0. To achieve a minimal and non-negative
solution we must have β = 0, otherwise we would have h0i > 1. Hence h0i = 1 for all i ⩾ 0.
Case 2 : If q < p, then q/p < 1, so (q/p)i − 1 < 0. A minimal solution requires a large value of
β. But if β > 1, then 1 − β < 0 which would give a negative solution because β(q/p)i → 0 as
i → ∞ and h0i = 1 − β + β(q/p)i . So we must have β = 1, hence h0i = (q/p)i for all i > 0.
Case 3 : If p = q = 1/2, then the auxiliary equation of the difference equation has repeated
roots so the general solution is instead

h0i = α + iβ

for constants α, β. h00 = 1 gives α = 1 and since h0i ∈ [0, 1] for all i we need β = 0, so h0i = 1
for all i ⩾ 0. Thus, even if a gambler goes to a fair casino with equal probability of winning
and losing their stake, they are certain to end up broke.

Example 1.8 (Birth-death chain). Consider the Markov chain with diagram:
1 BASICS OF MARKOV CHAINS 11

p1 p2
1 0 1 2 ···
q1 q2 q3

The state space I = N represents a population, pi the probability of a birth at population i


and qi = 1 − pi the probability of a death at population i. Given an initial population size i,
what is the probability that it will become extinct?
Our system of equations is h00 = 1 and

hi = pi hi+1 + qi hi−1

for i > 0. This difference equation does not have constant coefficients so we use an alternative
method.
Let ui = h0i−1 − h0i . Then

0 = qi h0i−1 − h0i + pi h0i+1 = qi (h0i−1 − h0i ) − pi (h0i − h0i+1 ) = qi ui − pi ui+1

so
qi qi qi−1
ui+1 = ui = ui−1 ,
pi pi pi−1
and repeating this i times gives
qi qi−1 · · · q1
ui+1 = γi u1 where γi = .
pi pi−1 · · · p1
Then
i
X i
X i
X
h00 − h0i = (h0j−1 − h0j ) = uj = u1 γj−1
j=1 j=1 j=1

and since h00 = 1 we have


i
X
h0i = 1 − u1 γj−1 .
j=1

We now have to consider what happens as i → ∞.


Case 1 : If ∞ 0 A
P
j=1 γj−1 diverges, we must have u1 = 0 so that hi ∈ [0, 1]. Hence hi = 1 for all
i ⩾ 0.
Case 2 : If ∞
P
j=1 γj−1 converges, it is positive since the terms
P∞ are products of probabilities, so
we must have u1 > 0. To ensure minimality we set u1 = 1/ j=1 γj−1 , so
Pi
j=1 γj−1
h0i = 1 − P∞ for all i > 0.
j=1 γj−1

We now make similar calculations for expected hitting times.

Example 1.9. Let A = {4} and consider ki4 for the Markov chain given by the same stochastic
1 BASICS OF MARKOV CHAINS 12

matrix as in Example 1.6,  


0 1/2 1/2 0
1/2 1/2 0 0 
P =
1/3 1/3 0 1/3 .

0 0 0 1
Using the same splitting argument
X X X
ki4 = E1 [H 4 ] = p1j E1 [H 4 | X1 = j] = p1j + p1j Ej [H 4 + 1]
j∈I j∈A j ∈A
/

since it takes one step to move into state j, so the hitting time from j is one greater than the
hitting time from i. Splitting further, we have that it is equal to
X X X X
p1j + p1j + p1j kj4 = 1 + p1j kj4 .
j∈A j ∈A
/ j ∈A
/ j ∈A
/

Our boundary condition k44 = 0 gives the following system of equations


1 1 4
k14 = 1 + k24 + k
2 2 3
1 1 4
k24 = 1 + k14 + k
2 2 3
1 1 4
k34 = 1 + k14 + k
3 3 2
k44 = 0

and solving this gives


k14 = 6, k24 = 5, k34 = 2.
Observe that 4 is an absorbing state. This influences the expected hitting times for other
states. For example, if we consider A = {2}, then k22 = 0 as expected, but for states 1, 2
and 4 there is a possibility to hit 4, and therefore stay in 4, without ever hitting 2. Thus
k12 = k32 = k42 = ∞.

Theorem 1.5 (Expected hitting times). The vector of expected hitting times k A = (kiA :
i ∈ I) for A ⊆ I is the minimal non-negative solution to the system of linear equations

kiA = 0 X if i ∈ A,
A A
ki = 1 + pij kj if i ∈
/ A.
j ∈A
/

Proof. Exercise.
Remark. We have ∞
X
Pi (H A ⩾ k) = Ei [H A ].
k=1

Proof. This can be deduced from the definition of expectation. Let X be a positive, discrete
random variable with values in N. Then (omitting the x = 0 case) we can write the expectation
1 BASICS OF MARKOV CHAINS 13

as a double sum and use Fubini’s theorem to interchange the sum:



X ∞ X
X x
E[X] = xP(X = x) = P(X = x)
x=1 x=1 k=1
X∞ X ∞ ∞
X
= P(X = x) = P(X ⩾ k).
k=1 x=k k=1

1.4 The strong Markov property


The Markov property states that the future is independent of the past given a fixed timestep.
We now look to extend this to the strong Markov property which asserts that the same result
holds if the time of conditioning is random and is a stopping time.
Theorem 1.6 (Strong Markov property). Let (Xn )n⩾0 ∼ Markov(λ, P ) and let T be a
stopping time for (Xn )n⩾0 . Then, conditional on {XT = i} and {T < ∞}, we have (XT +n )n⩾0 ∼
Markov(δi , P ) and is independent of X0 , . . . , XT .
Note that the Markov property is a special case of this.
Proof. Assume T < ∞ and let B be an event determined by X0 , . . . , XT . Then the event
B ∩ {T = m} for some m ∈ N only depends on X0 , . . . , Xm . Hence

P({XT = j0 , . . . , XT +n = jn } ∩ B ∩ {T = m} ∩ {XT = i})


= P(B ∩ {T = m} ∩ {XT = i})Pi (X0 = j0 , . . . , Xn = jn ) (4)

where we have used the Markov property in the secondS term. The events {T = m} for all
m ∈ N partition the event {T < ∞}, so {T < ∞} = ∞ m=0 {T = m}. So summing over all
values of m for (4) gives

P({XT = j0 , . . . , XT +n = jn } ∩ B ∩ {T < ∞} ∩ {XT = i})


X∞
= P(B ∩ {T = m} ∩ {XT = i})Pi (X0 = j0 , . . . , Xn = jn ).
m=0

Now dividing both sides by P({T < ∞} ∩ {XT = i}) gives

P({XT = j0 , . . . , XT +n = jn } ∩ B | {T < ∞} ∩ {XT = i})


= P(B | {T < ∞} ∩ {XT = i})Pi (X0 = j0 , . . . , Xn = jn )

and by Theorem 1.1, this has Markov(δi , P ) behaviour.


Example 1.10 (Embedded Markov chain). Let (Xn )n⩾0 ∼ Markov(λ, P ), J ⊆ I, and
define recursively the random variables
T0 = inf{n > 0 : Xn ∈ J}, Tm = inf{n > Tm−1 : Xn ∈ J}.
Assuming P(Tm < ∞) = 1 for all m, let Yn = XTn . We will show that (Yn )n⩾0 is a Markov
chain.
The sequence of times (Tm )m⩾0 is a sequence of stopping times for (Xn )n⩾0 . Moreover, the
1 BASICS OF MARKOV CHAINS 14

times Sm := Tm − Tm−1 are stopping times for the process (XTm−1 +n )n⩾0 , which we will prove
in Chapter 2.
So by the strong Markov property

P(Ym+1 = im+1 | Y0 = i0 , . . . , Ym = im ) = P(XTm+1 = im+1 | XT0 = i0 , . . . , XTm = im )


= Pim (XT0 = im+1 )

which we will denote by pim im+1 . Hence Yn = XTn is a Markov chain with stochastic matrix
P = (pij : i, j ∈ I).
To find these entries, we split by the following move. For all j ∈ / J we have pij = 0. If j ∈ J
then X X
pij = Pi (XT0 = j) = pij + pik Pk (XT0 = j) = pij + pik pkj .
k∈J
/ k∈J
/

For the next example we introduce the notion of probability generating functions. These
will also be used in Chapter 3.
Definition 1.10. Let X : Ω → R be a random variable. The probability generating function
(PGF) of X is defined by
GX (s) := E[sX ] for |s| < 1.
We have the following properties.
(i) Uniqueness: If X, Y are random variables and GX (s) = GY (s) for all s, then X and Y
have the same distribution.
(ii) Independence: If X, Y are independent random variables then GX+Y (s) = GX (s)GY (s).
(iii) We have
dk
lim GX (s) = E[X(X − 1) · · · (X − k + 1) | X < ∞].
s↗1 dsk

(iv) We have
1 dk
lim GX (s) = P (X = k).
s↘0 k! dsk

(v) If P(X ⩾ 0) = 1 then


lim GX (s) = P(X < ∞).
s↗1

Example: Let X ∼ Poisson(λ). The pmf is


(
µx e−µ
x!
if x ∈ N,
P(X = x) =
0 otherwise
so ∞ ∞
x −µ
X
xµ e −µ
X (µs)x
GX (s) = s =e = e−µ eµs = eµ(s−1) .
x=0
x! x=0
x!
We can verify property (iii) by checking that lims↗1 G′X (s) = E[X]:
lim G′X (s) = lim µeµ(s−1) = µ.
s↗1 s↗1
1 BASICS OF MARKOV CHAINS 15

Example 1.11 (Distribution of hitting times in gamblers’ ruin). Recall from the gam-
blers’ ruin example (Example 1.7) that the hitting times of state 0 are given by

1 if q ⩾ 12 ,
0
hi =  q i

p
if q < 12 .

We are interested in finding the probability of a specific hitting time P1 (H 0 = n), if we start
0
at 1. Let ϕ(s) := E1 [sH ] be the PGF of H 0 . The aim is to expand this as a power series in s.
By splitting according to the next states
0 0 +1 0
ϕ(s) = E1 [sH ] = pE2 [S H ] + qE1 [sH | X1 = 0]
0 0
because moving to state 2 increases the hitting time by 1. Now E2 [sH +1 ] = sE2 [sH ] by
factoring s. For the second term we are given X0 = 1 and X1 = 0 so the hitting time is
0
precisely 1, so E1 [sH | X1 = 0] = s. Hence
0 0
E1 [sH ] = psE2 [sH ] + qs. (5)

Conditioning on X0 = 2 we split H 0 as follows. Let H e 0 be the hitting time of 0 after the


chain hits 1. Then clearly
H0 = H1 + H e0

By the SMP, given {XH 1 = 1} and {H 1 < ∞}, then H e 0 and H 1 are independent, and by
construction are identically distributed. Now since H 0 is finite then H 1 must also be, since the
chain must hit 1 before it hits 0. So

E2 [sH ] = E2 [S H · 1{H 1 <∞} ]


0 0

0
= P2 (H 1 < ∞)E2 [sH | H 1 < ∞]
1 +H
e0
= P2 (H 1 < ∞)E2 [S H | H1 < ∞].

e 0 and H 1 this is equal to


Now by applying the SMP and independence of H

P2 (H 1 < ∞)E2 [sH | H 1 < ∞]E2 [sH | H 1 < ∞] = E2 [sH · 1{H 1 <∞} ]E2 [sH | H 1 < ∞]
1 e0 1 e0

e 0 , and symmetry, both terms equal E1 [sH 0 ] = ϕ(s) so


and by definition of H
0
E2 [sH ] = ϕ(s)2 .

Putting this back into (5) we obtain

ϕ(s) = psϕ(s)2 + qs.

Solving gives p
1± 1 − 4pqs2
ϕ(s) =
2ps
1 BASICS OF MARKOV CHAINS 16

so we need to choose a root. By property (iv) above and continuity of PGFs we require
ϕ(0) = lims↘0 ϕ(s) = P1 (H 0 = 0) ∈ [0, 1]. Taking the positive root gives ϕ(s) → ∞ as s ↘ 0.
By l’Hôpital’s rule we see that the negative root is the correct one:
1 1 8pqs s↘0
p −→ 0.
2p 2 1 − 4pqs2

Finally, we do a Taylor expansion of ϕ which yields


    
1 1 2 1 1 1 2 2
ϕ(s) = 1 − 1 + (−4pqs ) + − (−4pqs ) + . . .
2ps 2 2 2 2!
= qs + pq 2 s3 + . . .

and by definition of ϕ(s), we have



X
ϕ(s) = sx P1 (H 0 = x),
x=0

so by comparing coefficients we see that

P1 (H 0 = 1) = q, P1 (H 0 = 2) = 0, P1 (H 0 = 3) = pq 2 , P1 (H 0 = 4) = 0, ...
2 RECURRENCE AND TRANSIENCE 17

2 Recurrence and Transience


Definition 2.1. A state i ∈ I is recurrent if

P(Xn = i for infinitely many n) = 1.

It is transient if
P(Xn = i for infinitely many n) = 0.

Definition 2.2. The first passage time of state i is the stopping time

Ti (ω) := inf{n ⩾ 1 : Xn (ω) = i}

and inductively we define the rth passage time for r = 1, 2, 3, . . . by the stopping time
n o
(r) (r−1)
Ti (ω) := inf n > Ti : Xn (ω) = i ,

(0)
and Ti (ω) = 0.
The rth excursion time is defined by
(
(r) (r−1) (r−1)
(r) Ti − Ti if Ti < ∞,
Si :=
∞ otherwise.

from [J. R. Norris, Markov Chains]

In other words the rth passage time is the time of the rth visit to state i, not including 0.
The excursion time is the time between consecutive visits, which is not a stopping time for the
chain (Xn )n⩾0 , but it is for the following:
(r−1) (r)
Lemma 2.1. For r ⩾ 2, conditional on {Ti < ∞}, the time period Si is independent of
(r−1)
all {Xm : m ⩽ Ti } and
 
(r) (r−1)
P Si = n Ti < ∞ = Pi (Ti = n).
2 RECURRENCE AND TRANSIENCE 18

(r) (r−1)
Proof. The Ti are stopping times, so by the SMP and conditioning on {Ti < ∞} then
(XT (r−1) +n )n⩾0 is Markov(δi , P ) and independent of X0 , . . . , XT (r−1) . By definition
i i

n o
(r)
Si = inf n ⩾ 1 : XT (r−1) +n = i
i

(r)
so Si is the first passage time of the chain (XT (r−1) +n )n⩾0 to state i.
i

We now introduce the random variable Vi which counts all visits to a state i, defined by

1{Xn =i} .
X
Vi :=
n=0

Then ∞ ∞ ∞
Ei 1{Xn =i} =
(n)
X   X X
Ei [Vi ] = Pi (Xn = i) = pii .
n=0 n=0 n=0

Lemma 2.2. Let fi := Pi (Ti < ∞). For r = 0, 1, 2, . . . we have

Pi (Vi > r) = (fi )r .

This shows that Vi ∼ Geometric(1 − fi ).


Proof. Conditioning on {X0 = i} we see that
n o
(r)
{Vi > r} = Ti = ∞

because both show that there are at least r visits to state i after X0 .
We prove by induction. For r = 0 then

Pi (Vi > 1) = 1 = (fi )0 .

Now assuming true for some r we obtain


 
(r+1)
Pi (Vi > r + 1) = Pi Ti <∞
 
(r) (r+1)
= Pi Ti < ∞; Si <∞
   
(r) (r) (r)
= Pi Ti < ∞ Pi Si = n Ti < ∞ .

By Lemma 2.1 we obtain


 
(r)
Pi (Vi > r + 1) = Pi Ti < ∞ Pi (Ti < ∞)

then applying the induction hypothesis gives

Pi (Vi > r + 1) = (fi )r fi = (fi )r+1 .


2 RECURRENCE AND TRANSIENCE 19

Theorem 2.1 (Characterisation of recurrent and transient states).


(n)
(i) If Pi (Ti < ∞) = 1, then i is recurrent and ∞
P
n=1 pii = ∞.

(n)
(ii) If Pi (Ti < ∞) < 1, then i is transient and ∞
P
n=1 pii < ∞.

This tells us that any state is either recurrent or transient.


Proof. If fi := Pi (Ti < ∞) = 1 then by Lemma 2.2,

Pi (Vi = ∞) = lim Pi (Vi > n) = lim (fi )n = lim 1n = 1


n→∞ n→∞ n→∞

so i is recurrent, and
∞ ∞ ∞ ∞
(n)
X X X X
k
pii = Ei [Vi ] = Pi (Vi > k) = (fi ) = 1 = ∞.
n=1 k=0 k=0 k=0

On the other hand, if fi := Pi (Ti < ∞) < 1 then by Lemma 2.2,


∞ ∞ ∞
X (n)
X X 1
pii = Ei [Vi ] = Pi (Vi > k) = (fi )k = <∞
n=0 k=0 k=0
1 − fi

so, since the sum converges,

Pi (Vi = ∞) = lim Pi (Vi > r) = 0


r→∞

so i is transient.

Example: Consider the following Markov chain.

3 4 5 6 7

Can we find out whether state 5 is recurrent without knowing the exact probabilities?
Theorem 2.1 allows us to determine this easily. We have that

P(T5 < ∞) = 1 − P5 (T5 = ∞) ⩽ 1 − p57 < 1

since {6, 7} is a closed class.

Theorem 2.2 (Recurrence and transience are class properties). Let C be a communi-
cating class. Then either all states in C are transient, or all states in C are recurrent.
2 RECURRENCE AND TRANSIENCE 20

(r)
Proof. By Theorem 2.1, if i ∈ C is transient then ∞
P
r=0 pii < ∞. For any j ∈ C, since i ↔ j,
(n) (m)
there exist n, m ∈ N such that pij > 0 and pji > 0. For every r > 0 the law of total
probability gives
(r+m+n)
X (n) (r+m) X X (n) (r) (m) (n) (r) (m)
pii = pik pki = pik pkℓ pℓi ⩾ pij pjj pji (6)
k∈I k∈I ℓ∈I

so
(r+m+n)
(r) pii
pjj ⩽ (n) (m)
,
pij pji
which implies
∞ ∞ (r+m+n) ∞
X (r)
X p ii 1 X (r+m+n)
pjj ⩽ (n) (m)
= (n) (m)
pii <∞
r=0 r=0 pij pji pij pji r=0
so j is transient. This also proves the recurrence case.
The equalities in (6) are known as the Chapman-Kolmogorov equations.

Theorem 2.3. Every recurrent class is closed.


Proof. We prove the contrapositive. Let C be a communicating class which is not closed. Then
(m)
there exists i ∈ C which talks to some j ∈
/ C, so there is some m ∈ N such that pij > 0.
Denote by Ai the event {Xn = i for infinitely many n}. Then, since j ∈/ C and i → j then
j ↛ i, so Pi (Xm = j; Ai ) = 0. Hence

Pi (Ai ) = Pi (Xm = j; Ai ) + Pi (Xm ̸= j; Ai ) = Pi (Xm ̸= j; Ai )


(m)
⩽ Pi (Xm ̸= j) = 1 − pij < 1

so i is not recurrent.
Theorem 2.4. Every finite closed class is recurrent.
This is a partial converse to Theorem 2.3.
Proof. Let C be a finite closed class and X0 ∈ C. Since the chain cannot leave C, there is some
state i ∈ C visited infinitely often with nonzero probability. So

0 < P(Xn = i for infinitely many n)


= P(Xn = i for some n)Pi (Xn = i for infinitely many n)

which implies
Pi (Xn = i for infinitely many n) > 0
but this probability can only be 0 or 1 so it is equal to 1. Hence i is recurrent, so C is recurrent
by Theorem 2.2.
Theorem 2.5. Suppose that P is irreducible and recurrent. Then for every j ∈ I we have
P(Tj < ∞) = 1.
2 RECURRENCE AND TRANSIENCE 21

Proof. By the law of total probability


X
P(Tj < ∞) = P(X0 = i)Pi (Tj < ∞).
i∈I
P
Since i∈I P(X0 = i) = 1 then we just need to show that Pi (Tj < ∞) = 1 for all i.
Let m ∈ N. Then
1 = Pj (Xn = j for infinitely many n)
⩽ Pj (Xn = j for some n ⩾ m + 1)
X
= Pj (Xm = k)Pj (Xn = j for some n ⩾ m + 1 | Xm = k)
k∈I
X
= Pj (Xm = k)Pk (Tj < ∞)
k∈I
(m)
X
= pjk Pk (Tj < ∞).
k∈I
P (m)
Since k∈I pjk = 1 then Pk (Tj < ∞) = 1 for all k ∈ I, as desired.
We now investigate recurrence and transience of simple random walks.
Definition 2.3. A component-independent simple random walk (CISRW) on Zd for d ∈ N is
defined by
Xn+1 = Xn + Zn+1
1 d

where Zn+1 = Zn+1 , . . . , Zn+1 , with
(
j 1 with probability pj
Zm =
−1 with probability qj := 1 − pj

j
and the Zm are independent for all j, m.
This can be viewed as ‘moving along the diagonals’.
Proposition 2.1. Let (Xn )n⩾0 be a CISRW on Zd . If there exists j ∈ {1, . . . , d} such that
pj ̸= 1/2, then (Xn )n⩾0 is transient (every i ∈ Zd is transient).

Proof. Without loss of generality assume pj > qj . Since {T0 < ∞} ⊆ {T0j < ∞} for some
component j, where 0 denotes a zero on the jth component, then
P0 (T0 < ∞) ⩽ P0 (T0j < ∞) = P0 (T0j < ∞).
By splitting we obtain
P0 (T0j < ∞) = pj P1 (T0j < ∞) + qj P−1 (T0j < ∞)
and results from gamblers’ ruin in Example 1.7 can be used to show that P1 (T0j < ∞) < 1 and
P−1 (T0j < ∞) = 1. So
P0 (T0j < ∞) < pj + qj = 1
which implies the vector state 0 is transient, so all states are transient by irreducibility.
2 RECURRENCE AND TRANSIENCE 22

For a CISRW with all components symmetric (pj = qj = 1/2 for all j) then

• all states are recurrent for d ⩽ 2;

• all states are transient for d ⩾ 3.

We review the situation for dimensions 1, 2 and 3.


This requires Stirling’s formula:
√  n n
n! ∼ 2πn as n → ∞
e
where an ∼ bn if an /bn → 1 as n → ∞.

Example 2.1 (Symmetric SRW on Z). (Note that recurrence can alternatively be proved
(n)
using results from gamblers’ ruin.) Claim: ∞
P
n=0 p00 = ∞.

(n) P∞ (2n)
Proof. We have ∞
P
n=0 p00 = n=0 p00 because we can only return to 0 in an even number of
steps. Now    2n
(2n) 2n n n (2n)! 1
p00 = p q =
n (n!)2 2
and by Stirling’s formula
√  2n
(2n) 4πn(2n/e)2n 1 1
p00 ∼ =√
2πn(n/e)2n 2 πn
(2n)
so there exists N ∈ N such that p00 > √1 for all n ⩾ N . So
2 πn

∞ ∞    2n ∞
X (2n)
X 2n 1 X 1
p00 = ⩾ √ = ∞.
n=0 n=0
n 2 n=N
2 πn

This shows that 0 is recurrent, so by irreducibility all states are recurrent.

(n)
Example 2.2 (Symmetric SRW on Z2 ). Claim: ∞
P
n=0 p00 = ∞.
(2n)
Proof. Similarly this is equal to ∞
P
n=0 p00 . By component independence we have

(2n)
p00 = P(component 1 returns in 2n steps)P(component 2 returns in 2n steps)
"    #2 
2n 2 √ 2n
!2
2n 1 (2n)! 1 4πn(2n/e) 1 1
= = 2 4n
∼ 2n 4n
= ,
n 2 (n!) 2 2πn(n/e) 2 πn

(2n) 1
so there exists N ∈ N such that p00 > 2πn
for all n ⩾ N . So
∞ ∞
X (2n)
X 1
p00 ⩾ = ∞.
n=0 n=N
2πn
2 RECURRENCE AND TRANSIENCE 23

(n)
Example 2.3 (Symmetric SRW on Z3 ). Claim: ∞
P
n=0 p00 < ∞.

(2n)
Proof. Similarly this is equal to ∞
P
n=0 p00 and each component returns to 0 independently. So
"    #3 
2n 3 √ !3
(2n) 2n 1 (2n)! 1 4πn(2n/e)2n 1 1
p00 = = 2
∼ = .
n 2 (n!) 26n 2πn(n/e)2n 26n (πn)3/2

This time since we want to prove convergence we bound above: there exists N ∈ N such that
(2n)
p00 < (πn)2 3/2 for all n ⩾ N . So

∞ N −1 ∞
X (2n)
X (2n)
X 2
0⩽ p00 ⩽ p00 + < ∞.
n=0 n=0 n=N
(πn)3/2

So all states are transient.


3 BRANCHING PROCESSES 24

3 Branching Processes
In this chapter we study the long term properties of a transient stochastic process known as
the branching process. As motivation, we describe two models in epidemiology, the SIR and
SIS.

3.1 Disease models


Consider a closed population of N individuals and define

Sn = number of susceptibles at time n


In = number of infectives at time n
Rn = number of removed/recovered at time n.

We consider the random variable Xn := (Sn , In ). (Since the population is closed then Rn =
N − Sn − In so we do not need to consider Rn separately.)
The dynamics of the process are as follows:

• If a susceptible comes into contact with an infective in a given time step, they become
infected.

• At every time step, every susceptible individual avoids each infective individual with
probability p, independently of all other interactions. So

P(ith susceptible at time n avoids infection by time n + 1 | In ) = pIn .

• The recovery/removal times for infectives are i.i.d.

TI ∼ Geometric(γ), where γ is the recovery rate.

The time must be geometric for (Xn )n⩾0 to be a Markov chain.

Example 3.1 (SIS). In this model there is no recovery/removal – any recovered infectives
become susceptible. Let

An+1 = number of susceptibles at time n who avoided infection over the last time step
Bn+1 = number of infectives at time n who recovered over the last time step

so that
Sn+1 | Sn ∼ Binomial(Sn , pIn ) + Binomial(In , γ)
where Sn+1 = An+1 + Bn+1 , An+1 ∼ Binomial(Sn , pIn ), and Bn+1 ∼ Binomial(In , γ).
What are the transition probabilities? Here we only need Xn = Sn , since In is immediately
3 BRANCHING PROCESSES 25

determined. For v, w ∈ {0, . . . , N }

P(Sn+1 = v | Sn = w) = P(An+1 + Bn+1 = v | Sn = w)


Xw
= P(An+1 = k; Bn+1 = v − k | Sn = w)
k=0
w
X
= P(An+1 = k | Sn = w) · P(Bn+1 = v − k | Sn = w)
k=0
w  
N − w v−k
 
X w
= In k In w−k
(p ) (1 − p ) · γ (1 − γ)N −w−(v−k) .
k=0
k v−k

Additionally P(Sn+1 = N | Sn = N ) = 1, since if Sn = N then there are no infectives to infect


anyone so everyone remains susceptible. Hence there are two communicating classes: {N },
which is an absorbing state, and {0, . . . , N − 1}, which is transient.

Example 3.2 (SIR). In this model we have all three categories S, I, R and no return to
susceptibility: here R could represent either death (removal) or immunity (recovery). Here
Xn = (Sn , In ) and we have

Sn+1 | Sn , In ∼ Binomial(Sn , pIn )


In+1 ∼ Binomial(In , 1 − γ) + (Sn − Sn+1 )

where the first term accounts for infectives who do not recover, and the second accounts for
people who become newly infective over the last time step.
The transition probabilities are given by

P(Xn+1 = (v, x) | Xn = (w, y))


= P(Sn+1 = v | Xn = (w, y)) · P(In+1 = x | Xn = (w, y); Sn+1 = v)
   
w y
= y v y w−v
(p ) (1 − p ) · (1 − γ)x−(w−v) γ y−(x−(w−v)) · 1{y⩾x−(w−v)⩾0}
v x − (w − v)

since there are w−v new infectives, so the number of infectives who do not recover is x−(w−v).
The states {(k, 0) : k ∈ {0, . . . , N }} are all absorbing, because there are no infectives to infect
any susceptibles. All other states are transient and form single communicating classes.

Definition 3.1. The basic reproduction number R0 is the average number of infectives caused
by an infective individual during an early stage of an epidemic (typically in a large population).

When N is large we have N ≈ S0 ≫ I0 = 1. Then during the initial infective’s infectious


period we have " T #
X
R0 = E Cj
j=1

where T ∼ Geometric(γ) is the recovery time of the initial infective and Cj is the number of
people they infect in the jth time step. Now Cj ∼ Binomial(N, 1 − p), and we can approximate
3 BRANCHING PROCESSES 26

R0 in the following way, using the tower property.


" " T ## " T #
X X N (1 − p)
R0 = E E Cj T =E E[Cj | T ] ≈ E[T · N · (1 − p)] = .
j=1

j=1
γ

3.2 The branching process


Let Xn represent the number of individuals in a population at time n. At each time step, each
individual in the population gives birth to a random number of offspring which make up the
next generation. So
Xn
i.i.d.
X
Xn+1 = Zin , Zin ∼ Z
i=1
where Z is the common offspring distribution, and Zin denotes the number of offspring from the
ith individual at time n. So (Xn )n⩾0 is a Markov chain with state space N, and 0 an absorbing
state.
Define the following PGFs
G(s) := E[sZ ]
Fn (s) := E[sXn | X0 = 1].
Proposition 3.1. We have

| ◦ ·{z
Fn (s) = G · · ◦ G}(s) = G(Fn−1 (s)).
n times

Proof. We have

E[sXn+1 · 1{Xn =k} | X0 = 1]
X
Xn+1
Fn+1 (s) = E[s | X0 = 1] =
k=0

X
= P(Xn = k | X0 = 1)E[sXn+1 | Xn = k]
k=0
∞ Pk
Zjn
X
= P(Xn = k | X0 = 1)E[s j=1 | Xn = k].
k=0

By independence of offspring, this is equal to



X k
Y ∞
X
P(Xn = k | X0 = 1) E[sZ ] = P(Xn = k | X0 = 1)(G(s))k = Fn (G(s)). (7)
k=0 j=1 k=0

Now
F0 (s) = E[sX0 | X0 = 1] = s
so (7) implies that
F1 (s) = F0 (G(s)) = G(s)
and repeating this n − 1 more times gives
Fn (s) = Fn−1 (G(s)) = · · · = G
| ◦ ·{z
· · ◦ G}(s).
n times
3 BRANCHING PROCESSES 27

Proposition 3.2 (Mean of a branching process). Let µ = E[Z]. Then

E[Xn | X0 = 1] = µn .
Proof. By the tower property,

E[Xn | X0 = 1] = E[E[Xn | Xn−1 ; X0 = 1] | X0 = 1]


" " Xn−1 # #
X
=E E Zjn−1 Xn−1 ; X0 = 1 X0 = 1


j=1
" Xn−1 #
X
=E E[Zjn−1 | Xn−1 ; X0 = 1] X0 = 1


j=1

= E[Xn−1 E[Z] | X0 = 1]
= µE[Xn−1 | X0 = 1].

Repeating this n − 1 more times gives

E[Xn | X0 = 1] = µn E[X0 | X0 = 1] = µn .

3.3 The extinction probability


Extinction is the event {Xn = 0 for some n}. By independence of offspring
k
Y
P(Xn = 0 for some n | X0 = k) = P(Xn = 0 for some n | X0 = 1)
i=1

= (P(Xn = 0 for some n | X0 = 1))k

so we only need to consider the extinction probability for one individual’s descendants. Now


!
[
P(Xn = 0 for some n | X0 = 1) = P {Xm+j = 0 for all j ⩾ 0} X0 = 1

m=1

= lim P(Xm+j = 0 for all j ⩾ 0 | X0 = 1)
m→∞
= lim P(Xm = 0 | X0 = 1)
m→∞
= lim Fm (0)
m→∞
= lim G(Fm−1 (0)).
m→∞

Proposition 3.3. Let


xn = P(Xn = 0 | X0 = 1) = Fn (0).
Then x1 = G(0), and xn+1 = G(xn ) for n ⩾ 2.
Proof. We have
x1 = P(X1 = 0 | X0 = 1) = P(Z = 0) = G(0),
and the iterative formula follows from Proposition 3.1.
3 BRANCHING PROCESSES 28

We can now determine the extinction probability, but first we cover two side cases:
• If G(0) = P(Z = 0) = 0, then xn = 0 for all n, so the population cannot die out and the
extinction probability is 0.

• If G(0) = P(Z = 0) = 1, then every individual has no offspring, so there is immediate


extinction at timestep 1 so the extinction probability is 1.
Theorem 3.1 (Extinction probability). The extinction probability is the minimal non-
negative root of
α = G(α).

Proof. We have proved the cases for G(0) = 0 or 1 above; now assume G(0) ∈ (0, 1). Since

X
G(s) = P(Z = 0) + sk P(Z = k)
k=1

then ∞
X
G′ (s) = ksk−1 P(Z = k) > 0
k=1

so G is strictly increasing on (0, 1). Now, let x0 = 0 (consistent with the definition) then
x1 = G(0) > 0 = x0 , so by Proposition 3.3

G(x1 ) > G(x0 ) ⇔ x2 > x1

and iterating repeatedly shows that the sequence {xn }n⩾0 is strictly increasing, but it is also
bounded above by 1, so it converges to some limit α. By Proposition 3.1 and continuity of G
we have
α = lim xn = lim G(xn−1 ) = G(α).
n→∞ n→∞

Now suppose β = G(β) where β is also non-negative. Then β ⩾ 0 = x0 , so

G(β) ⩾ G(x0 ) ⇔ β ⩾ G(x0 ) = x1

and iterating gives β ⩾ xn for all n ⩾ 0, so

β ⩾ lim xn = α
n→∞

and hence α is the minimal root.


Example 3.3. Let Z ∼ Geometric(p) which is parametrised by
(
(1 − p)i p if i ∈ N,
P(Z = i) =
0 otherwise.
Then ∞
X p
G(s) = E[sZ ] = sk (1 − p)k p = .
k=0
1 − s(1 − p)
1. What is the extinction probability if X0 = 1?
3 BRANCHING PROCESSES 29

2. What is the extinction probability if X0 = M ?

Solution:

1. We have
p
α = G(α) ⇔ α=
1 − α(1 − p)
⇔ p = α − α2 (1 − p)
α p
⇔ 0 = α2 − +
1−p 1−p
p
⇔ α = 1 or ,
1−p
so we choose the minimal root
(
1 if p ⩾ 12 ,
α= p
1−p
if p < 12 .

2. This is just the previous answer raised to the power of M , by independence of offspring.

Remark. Observe that 1 is always a solution to the equation α = G(α).


Solving the extinction probability is not always easy. We often do not care about the exact
probabilities, but just whether there is any chance that the process will become extinct. For
this we use the following theorem.

Theorem 3.2 (Mean criterion for extinction). Suppose that G(0) > 0, and µ := E[Z] =
G′ (1), with Z finite almost surely. Then

(i) if µ ⩽ 1, extinction is certain;

(ii) if µ > 1, extinction is not certain.

Remark. Since the branching process is a Markov chain, then certain extinction means P1 (T0 <
∞) = 1. Not certain extinction means P1 (T0 < ∞) < 1.
Proof. In the proof of Theorem 3.1 we showed that G is strictly increasing, and positive. Since
P(Z < ∞) = 1, then G(1) = 1. Also

X
′′
G (s) = j(j − 1)sj−2 P(Z = j) ⩾ 0
j=2

for s ∈ [0, 1], so G is convex on [0, 1]. We have two cases:


Case 1 : if µ = G′ (1) ⩽ 1, then G does not cross the line ‘y = x’ before 1, so G(s) ̸= s for any
s ∈ (0, 1). Hence the minimal non-negative root is α = 1, so extinction is certain.
Case 2 : if µ = G′ (1) > 1, then G does cross the line ‘y = x’ before 1, so the minimal
non-negative root of α = G(α) is less than 1. In this case extinction is not certain.
3 BRANCHING PROCESSES 30

Example 3.4.

(i) If Z ∼ Geometric(p) as in Example 3.3 then


1−p
µ = E[Z] = .
p

Now µ > 1 ⇔ p < 1/2, so extinction is not certain if p < 1/2 and is certain if p ⩾ 1/2, as
shown in the previous solution.

(ii) If Z ∼ Poisson(λ), then E[Z] = λ. Extinction is certain if λ ⩽ 1 and is not certain if


λ > 1.
4 INVARIANT DISTRIBUTIONS 31

4 Invariant Distributions
In this chapter we aim to understand long-term behaviour of Markov chains in recurrent cases.

4.1 Invariant measures and distributions


Definition 4.1. A measure λ = (λi : i ∈ I) with non-negative entries is said to be invariant
for P if λP = λ.
P We say π = (πi : i ∈ I) is an invariant distribution for P if π is an invariant measure and
i∈I πi = 1.

Example 4.1. Let  


1/2 1/2 0
P = 1/2 1/2 0 .
0 0 1
The measure λ = (1, 1, 1) is an invariant for P . We can normalise this to obtain π =
(1/3, 1/3, 1/3) as an invariant distribution. We emphasise that it is not unique: for exam-
ple, (0, 0, 1) is also an invariant distribution.

Theorem 4.1. Let (Xn )n⩾0 ∼ Markov(π, P ) where π is an invariant distribution for P . Then
(Xm+n )n⩾0 ∼ Markov(π, P ) for all m ⩾ 0.

Proof. By the Markov property, (Xm+n )n⩾0 is Markov with stochastic matrix P . Then

P(Xm = i) = (πP m )i = (πP m−1 )i = · · · = πi

so the initial distribution is also π.


(n)
Theorem 4.2. Let I be finite and suppose that for some i ∈ I we have pij → πj as n → ∞
for every j ∈ I. Then π := (πj : j ∈ I) is an invariant distribution.

Proof. Firstly, π is a distribution because


(n)
X X X (n)
πj = lim pij = lim pij = 1
n→∞ n→∞
j∈I j∈I j∈I

(n)
and since pij ∈ [0, 1] for all j then so is πj . Now

(n) (n−1) (n−1)


X X X
πj = lim pij = lim pik pkj = lim pik pkj = πk pkj = (πP )j
n→∞ n→∞ n→∞
k∈I k∈I k∈I

hence π is invariant.
Example 4.2. Recall the gene mutation model from Example 1.2 where
 
1−α α
P = .
β 1−β
4 INVARIANT DISTRIBUTIONS 32

We found that
(n) β α
p11 = + (1 − α − β)n .
α+β α+β
Taking limits as n → ∞ we obtain an invariant distribution
 
β α
π= , .
α+β α+β

We could also obtain this by solving πP = π directly: we have the system

π1 (1 − α) + π2 β = π1
π1 α + π2 (1 − β = π2
π1 + π2 = 1

and the answer follows. In this case, the invariant distribution is unique.
In Example 4.1 all states are recurrent and P is not irreducible. Observe that any π =
(x, x, 1 − 2x) for x ∈ [0, 1] gives an invariant distribution. The key here is irreducibility, which
we aim to prove next.
Definition 4.2. The expected time spent in state i between consecutive visits to state k is
"T −1 #
k

1{Xn =i}
X
γik := Ek
n=0

where Tk is the first passage time of state k.


Note that since we include n = 0 in the sum, γkk is always equal to 1.
Theorem 4.3. Let P be irreducible and recurrent. Then for every k ∈ I

(i) γ k := (γik : i ∈ I) is an invariant measure for P , and

(ii) 0 < γik < ∞ for all i ∈ I.


Proof. (i): The event {n ⩽ Tk } only depends on X0 , . . . , Xn−1 , so by the Markov property at
time n − 1 we have
Pk (Xn−1 = i, Xn = j; n ⩽ Tk ) = Pk (Xn−1 = i)Pk (Xn = j; n ⩽ Tk | Xn−1 = i)
= Pk (Xn−1 = i)Pk (Xn = j | Xn−1 = i)Pk (n ⩽ Tk | Xn−1 = i)
= Pk (Xn−1 = i; n ⩽ Tk )pij . (8)
Since P is recurrent, then under Pk we have Pk (Tk < ∞) = 1 and X0 = XTk = k. So in
Definition 4.2, summing from 1 to Tk gives the same result as summing from 0 to Tk − 1. Hence
"T # "∞ # ∞
k

1{Xn =j} = E 1{Xn =j;n⩽Tk } =


X X X
k
γj = E Pk (Xn = j; n ⩽ Tk )
n=1 n=1 n=1
∞ X
X ∞
XX
= Pk (Xn = j; Xn−1 = i; n ⩽ Tk ) = Pk (Xn = j; Xn−1 = i; n ⩽ Tk ).
n=1 i∈I i∈I n=1
4 INVARIANT DISTRIBUTIONS 33

Now using (8) we can write



" ∞
#
1{Xm =i;m⩽Tk −1}
X X X X
γjk = pij Pk (Xn−1 = i; n ⩽ Tk ) = pij Ek
i∈I n=1 i∈I m=0
"T −1 #
k

1{Xm =i} =
X X X
= pij Ek pij γik = (γ k P )j
i∈I m=0 i∈I

so γ k = γ k P .
(n) (m)
(ii): Since P is irreducible then for every i ∈ I there exist m, n such that pik , pki > 0. Then
(m) (m)
X
γik = γjk pji ⩾ γkk pki > 0,
j∈I

and
Tk
1{Xn =i} < ∞
X
γik =
n=1

since Tk < ∞.
Theorem 4.4 (Uniqueness of measure). Let P be irreducible and λ an invariant measure
for P , with λk = 1 for some k. Then λ ⩾ γ k . If P is also recurrent, then λ = γ k .
Proof. For each j ∈ I, we have
X X X X
λj = λi0 pi0 j = λi0 pi0 j + pkj = λi1 pi1 i0 pi0 j + pkj + pki0 pi0 j .
i0 ∈I i0 ̸=k i0 ,i1 ̸=k i0 ̸=k

Iterating after n steps, we obtain


X X X
λj = λin pin in−1 · · · pi0 j + pkj + pki0 pi0 j + · · · + pkin−1 · · · pi0 j
i0 ,...,in ̸=k i0 ̸=k i0 ,...,in−1 ̸=k
| {z } | {z } | {z }
⩾0 Pk (X1 =j;Tk ⩾1) Pk (Xn =j;Tk ⩾n)
n
" n
#
1{Xm =j;Tk ⩾m} n→∞
X X
⩾ Pk (Xm = j; Tk ⩾ m) = Ek −→ γjk
m=1 m=1

which proves that λ ⩾ γ k .


If P is also recurrent, then γ k is invariant by Theorem 4.3. Therefore

µ := λ − γ k
(n)
is also an invariant measure. By irreducibility, for every i ∈ I there exists n with pik > 0, so
(n) (n)
X
0 = µk = µj pjk ⩾ µi pik ⩾ 0
j∈I

which implies µi = 0 for all i and hence λ = γ k .


4 INVARIANT DISTRIBUTIONS 34

Definition 4.3. A state i ∈ I is positive recurrent if it is recurrent and

mi := Ei [Ti ] < ∞.

A state which is recurrent but not positive recurrent is said to be null recurrent.
So we always expect to visit positive recurrent states in finite time.
Theorem 4.5 (Invariance equivalences). Let P be irreducible. The following are equivalent:

(i) every state is positive recurrent;

(ii) some state i is positive recurrent;

(iii) P has an invariant distribution π.

Moreover, πi = 1/mi .
While not explicitly stated, Theorem 4.4 implies that π is unique in (iii).
Proof. (i) ⇒ (ii) is obvious.
(ii) ⇒ (iii): The state i is positive recurrent so it is also recurrent. By Theorem 4.3, γ i is an
invariant measure. Then
"T −1
i −1
# " #
i X TX
1{Xk =j} = Ei 1{Xk =j}
X X X
γji = Ei
j∈I j∈I k=0 j∈I k=0
"T −1 #
Xi

= Ei 1 = Ei [Ti ] = mi < ∞
k=0

so γ i can be normalised to obtain an invariant distribution π = γ i /mi .


(n)
(iii) ⇒ (i): Since P is irreducible for any k ∈ I there exists n such that pik > 0, so
(n)
X
πk = πj pjk > 0.
j∈I

Set λi = πi /πk . Then λ is an invariant measure with λk = 1, so by Theorem 4.4 we have λ ⩾ γ k


and hence X πi
X 1
mk = γik ⩽ = (9)
i∈I i∈I
πk πk
which is finite, so k is positive recurrent. Since k was arbitrary then every state is positive
recurrent.
For the final statement, since positive recurrence implies recurrence then λ = γ k by Theorem 4.4,
so (9) becomes equality, which implies mk = 1/πk .
Remark. If we have irreducibility and πP = π can be solved for a distribution π, then all states
are positive recurrent and the expected return time to a state i can be computed easily by
mi = 1/πi .
Furthermore, if P is irreducible and recurrent but π cannot be normalised, then all states
must be null recurrent.
4 INVARIANT DISTRIBUTIONS 35

Example 4.3 (Finite state space). Let


 
1/2 1/2 0
P = 1/3 1/3 1/3 .
0 1/2 1/2

This is irreducible. Solving πP = π for a distribution gives the following system


1 1
π 1 + π2 = π1
2 3
1 1 1
π1 + π 2 + π3 = π2
2 3 2
1 1
π 2 + π3 = π3
3 2
π1 + π 2 + π3 =1

from which we obtain π = (2/7, 3/7, 2/7). This implies all states are positive recurrent, hence
π is unique.
Note that the system πP = π will always include a redundant equation: the distribution
condition is needed to determine π uniquely.

Example 4.4 (SRW on Z). Assume p ∈ (0, 1), so that we have irreducibility, and p ̸= q.
If there was an invariant distribution then all states would be positive recurrent and therefore
recurrent, which is a contradiction. By Proposition 2.1 the SRW is transient. We can confirm
this by solving for an invariant measure: if λP = λ then for all i ∈ Z
1 p
λi = pλi−1 + qλi+1 ⇒ λi+1 − λi + λi−1 = 0
q q
from which we obtain  i
p
λi = A + B .
q
So λ is an invariant measure for all A, B ⩾ 0 but it cannot be normalised, so there is no
invariant distribution.
Now consider the symmetric case, p = q = 1/2. Solving λP = λ gives

λi = A + iB.

For this to be a measure we need B = 0, so λi = A gives an invariant measure for all A ⩾


0. Assume A ̸= 0 and divide by A to obtain an invariant measure (1 : i ∈ Z). Since the
chain is irreducible, then by Theorem 4.4 this is the unique measure with an entry equal to 1.
Furthermore a symmetric SRW on Z is recurrent so λ = γ k .
This means in between consecutive visits to a state k, even though all states are recurrent,
we only expect to visit each state once! This also cannot be normalised, so every state is null
recurrent.
Example 4.5 (Success run chain). Consider a basic model for a length of an unbeaten run
for a sports team. Assume the team cannot draw, only win or lose, and the probability of
4 INVARIANT DISTRIBUTIONS 36

winning the (i + 1)th game having won the previous ith game is pi . Then

p i
 if j = i + 1,
pij = qi := 1 − pi if j = 0,

0 otherwise.

So  
q0 p0 0 0 0 · · ·
q1 0 p1 0 0 · · ·
P = q 0 0 p .
 
 2 2 0 · · ·
.. .. ..

..
. . . .
This is irreducible. If πP = π then

X
π0 = πk q k ,
k=0
k−1
!
Y
πk = πk−1 pk−1 = · · · = π0 pj .
j=0

So the sum of the entries is



" ∞ k−1
#
X X Y
πk = π0 1 + pj .
k=0 k=1 j=0

Whether we can normalise π (i.e. set the sum to 1) depends on the pj .


Case 1 : If pi = p for all i, then set
" ∞
#
X p
1 = π0 1 + p k = π0 +
k=1
1−p

so π0 = 1 − p and πk = (1 − p)pk , a geometric distribution.


Case 2 : If ∞
Q P∞ Qk−1
j=1 pj > 0, then k=1 j=0 = ∞ so there is no invariant distribution.

Case 3 : If ∞
P Qk−1 P∞ Qk−1
k=1 j=0 pj < ∞ then we can set C = k=1 j=0 pj , so that

k−1
1 1 Y
π0 = , πk = pj .
1+C 1 + C j=0

4.2 Convergence to equilibrium


If we start in an invariant distribution, then we stay in the invariant distribution. Now we
consider what happens in the long run if we start in an arbitrary distribution.
We do not always converge to an invariant distribution:
4 INVARIANT DISTRIBUTIONS 37

Example 4.6. Consider  


0 1
P =
1 0
on the state space I = {1, 2}. We have a unique invariant distribution π = (1/2, 1/2). But if
the initial distribution is δ1 = (1, 0), then
(
1 if n is even,
Pδ1 (Xn = 1) =
0 if n is odd

so limn→∞ Pδ1 (Xn = 1) does not exist.

Definition 4.4. The periodicity of a state i ∈ I is


(n)
d := gcd{n > 0 : pii > 0}.

Definition 4.5. A state i ∈ I is aperiodic if it has periodicity 1.


(n)
Proposition 4.1. A state i ∈ I is aperiodic if and only if there exists N ∈ N with pii > 0 for
all n ⩾ N .
Proof. The reverse implication is an exercise. The forward implication is difficult and non-
examinable.
Lemma 4.1. Suppose P is irreducible and i is aperiodic. Then for every j, k ∈ I there exists
(n)
N ∈ N such that pjk > 0 for all n ⩾ N .

That is, aperiodicity is a class property.


(r) (s)
Proof. Since P is irreducible there exist r, s with pji , pik > 0. Since i is aperiodic there exists
(n)
N such that pii > 0 for all n > N . So for sufficiently large n, the Chapman-Kolmogorov
equations give
(r+n+s) (r) (n) (s)
pjk = pji pii pik > 0.

Theorem 4.6 (Convergence to equilibrium). Let P be irreducible and aperiodic with in-
variant distribution π. Let λ be any distribution. If (Xn )n⩾0 ∼ Markov(λ, P ) then

P(Xn = j) → πj as n → ∞.

In particular
(n)
pij → πj as n → ∞
for all i, j ∈ I.
Proof. Non-examinable.
Example 4.7. Let Xn be the weather type on the nth day:
(
1 if it is dry,
Xn =
2 if it is wet.
4 INVARIANT DISTRIBUTIONS 38

Suppose (Xn )n⩾0 is Markov(λ, P ) where


 
2/3 1/3
P = .
1/5 4/5

Solving πP = π gives π = (3/8, 5/8).


Suppose it is wet on the 0th day. By Theorem 4.6 the probability that it will be wet on a
day in the long run is 5/8.

4.3 Time reversal


The Markov property is very symmetrical with reverse time: just swap the meaning of ‘past’
and ‘future’. However, convergence to equilibrium generally does not provide a symmetrical
notion when starting in an arbitrary distribution. In the case where we start with an invariant
distribution, we have a sense of stability in the chain in both directions.

Theorem 4.7. Let P be irreducible with invariant distribution π. Suppose that (Xn )0⩽n⩽N ∼
Markov(π, P ) and set Yn = XN −n . Then (Yn )0⩽n⩽N ∼ Markov(π, Pb), where Pb = (p̂ij : i, j ∈ I)
is given by
πj p̂ji = πi pij ,
and Pb is also irreducible with invariant distribution π.
Proof. By irreducibility of P , we have πi > 0 for all i ∈ I.
First, we show that Pb is a stochastic matrix:
X 1 X
p̂ji = πi pij
i∈I
πj i∈I

and since πP = π then the RHS sum is equal to πj , so the above is equal to 1.
Secondly, we show that π is invariant for Pb:
X X X
(π Pb)i = πj p̂ji = πi pij = πi pij = πi .
j∈I j∈I j∈I

Thirdly, we show that Y is Markov(π, Pb):

P(Y0 = i0 , . . . , YN = iN ) = P(X0 = iN , . . . , XN = i0 ) = πiN piN iN −1 piN −1 iN −2 · · · pi1 i0 .

Using that π is invariant for Pb we can write

P(Y0 = i0 , . . . , YN = iN ) = πiN −1 p̂iN −1 iN piN −1 iN −2 · · · pi1 i0

and applying this repeatedly gives

P(Y0 = i0 , . . . , YN = iN ) = πi0 p̂i0 i1 · · · p̂iN −1 iN .


4 INVARIANT DISTRIBUTIONS 39

Finally, we show that Pb is irreducible. Let i, j ∈ I. By irreducibility of P , there exist


i0 , . . . , in with i0 = i and in = j such that pi0 i1 · · · pin−1 in > 0, so
πin−1 πi πi
p̂in in−1 · · · p̂i1 i0 = pin−1 in pin−2 in−1 n−2 · · · pi0 i1 0
π in πin−1 πi 1
π i0
= pi i · · · pin−1 in
πi n 0 1
>0

and since i, j are arbitrary then Pb is irreducible.

Definition 4.6. The chain (Yn )0⩽n⩽N defined above is called the time reversal of (Xn )0⩽n⩽N .

Definition 4.7. A stochastic matrix P and measure λ are said to be in detailed balance if

λi pij = λj pji for all i, j ∈ I.

Lemma 4.2. If P and λ are in detailed balance, then λ is invariant for P .


Proof. For every j ∈ I, X X
(λP )j = λi pij = λj pji = λj .
i∈I i∈I

Definition 4.8. Suppose (Xn )n⩾0 is Markov(λ, P ) where P is irreducible. We say that (Xn )n⩾0
is reversible if for every N ⩾ 1, the chain (XN −n )0⩽n⩽N is also Markov(λ, P ).

Theorem 4.8. Let (Xn )n⩾0 ∼ Markov(λ, P ) where P is irreducible. The following are equiva-
lent:

(i) (Xn )n⩾0 is reversible;

(ii) P and λ are in detailed balance.

Proof. If (Xn )n⩾0 is reversible, then

λj pji
pij = p̂ij = for all i, j ∈ I
λi
so P and λ are in detailed balance.
Conversely, if P and λ are in detailed balance then by Lemma 4.2 λ is invariant for P , and
by Theorem 4.7,
λj pji λi pij
p̂ij = = = pij for all i, j ∈ I.
λi λi
Example 4.8 (A non-reversible chain). Let
 
0 2/3 1/3
P = 1/3 0 2/3 .
2/3 1/3 0
4 INVARIANT DISTRIBUTIONS 40

Solving πP = π gives π = (1/3, 1/3, 1/3). By Theorem 4.7, we find


π2
p̂12 = p21 = p21 ̸= p12 ,
π1
so the chain is not reversible.
Detailed balance can often be used to find invariant distributions more easily than solving
directly.

Example 4.9 (Reflected gamblers’ ruin). Consider the gamblers’ ruin problem, but the
gambler has probability p of recovering once going broke, and q of remaining broke. Assume
p ∈ (0, 1/2).

p p p
q 0 1 2 ···
q q q

Solving for a measure λ in detailed balance gives, for all i ⩾ 0,


 i+1
p p
λi p = λi+1 q ⇒ λi+1 = λi = · · · = λ0 .
q q

Since p < 1/2, we can normalise this, so we set


∞ ∞  k
X X p λ0
1= λk = λ0 = .
k=0 k=0
q 1 − p/q

Therefore, λ0 = 1 − p/q and an invariant distribution is given by


   k
p p
πk = 1 − .
q q

Example 4.10 (Random walk on a graph). Let us represent a friend network via a finite,
connected graph. We have 8 people represented by vertices 1–8. We put an edge between i and
j if they have each other in their contacts.

1 3

8 4

7 5

6
4 INVARIANT DISTRIBUTIONS 41

Person 1 starts a rumour by telling one person in their contacts, chosen at random with
uniform probability, that every proof in ST202 is examinable. The person who receives this
does the same to their contacts, and so on. What is the long-run probability πi that at a given
time step, person i is the latest person to receive the rumour?
Let Vi represent how many people person i has in their contacts. Then
(
1
if (i, j) is an edge,
pij = Vi
0 otherwise.

We can solve detailed balance to find an invariant distribution π. By symmetry, we have pij ̸= 0
if and only if pji ̸= 0, so we only need to consider cases where i and j have each other in their
contacts:
1 1
λi pij = λj pji ⇒ λi = λj .
Vi Vj
A solution to this is given by λi = Vi . Since this distribution is finite, we can normalise it to
obtain
Vi
πi = P .
j∈I Vj

Since this graph is connected, the chain is irreducible, so π is unique.

4.4 The Ergodic Theorem


The Ergodic Theorem for Markov chains is concerned with the long run proportion of time
spent in each state, whenever we begin with an arbitrary distribution. One case is given by the
following version of the Strong Law of Large Numbers:

Theorem 4.9 (Strong Law of Large Numbers). Let Y1 , Y2 , . . . be a sequence of i.i.d. non-
negative random variables with E[Yi ] = µ < ∞. Then
n
!
1X
P Yj → µ as n → ∞ = 1.
n j=1

However, in general we do not have independence, so we instead consider the number of


visits to a state before time n:
n−1
1{Xk =i} .
X
Vi (n) :=
k=0

Thus Vi (n)/n is the proportion of time spent in state i before time n.


4 INVARIANT DISTRIBUTIONS 42

Theorem 4.10 (Ergodic Theorem). Let P be irreducible and λ be any distribution. If


(Xn )n⩾0 ∼ Markov(λ, P ), then
 
Vi (n) 1
P → as n → ∞ = 1.
n mi

Moreover, if (Xn )n⩾0 is positive recurrent, then for any f : I → R such that Eπ (|f (X)|) < ∞,
we have !
n−1
1X
P f (Xk ) → f as n → ∞ = 1
n k=0
where X
f := Eπ (f (X)) = πi f (i).
i∈I

Proof. Non-examinable.
Example 4.11. A student on ST202 is known to buy large quantities of milk at Rootes Grocery
Store, causing trouble for their milk supplies. Let Xn denote their attendance to lectures on
the nth day, with the course assumed to be life-long, where
(
0 if they do not attend,
Xn =
1 if they attend

and  
1/5 4/5
P = .
1/3 2/3
If they attend, milk is restocked at Rootes with probability 9/10. Otherwise it is restocked
with probability 3/10. In the long run, what is the probability that milk needs to be restocked
on a given day?
Define a function (
3
10
if X = 0,
f (X) = 9
10
if X = 1
which represents the probability that milk is restocked. By solving πP = π, we obtain an
invariant distribution  
5 12
π= , .
17 17
Since P is irreducible, then the existence of an invariant distribution implies positive recurrence
by Theorem 4.5.
So we can apply the Ergodic Theorem to obtain
n−1
1X n→∞ 5 12 123
f (Xi ) −→ Eπ (f (X)) = f (0) + f (1) = ≈ 0.72.
n i=0 17 17 170
4 INVARIANT DISTRIBUTIONS 43

4.5 Markov Chain Monte Carlo


Throughout this chapter we have started with a given Markov chain and deduced an invariant
distribution from it. Markov Chain Monte Carlo reverse engineers this process: we begin with
a desired distribution which cannot be sampled from, and construct a Markov chain whose
invariant distribution is as desired. Running this Markov chain for long enough will eventually
provide samples according to the desired distribution. Moreover, by the Ergodic Theorem the
long run averages converge.
Let us start with a distribution π = (πi : i ∈ Z), which we would like to draw samples from.
For simplicity assume πi > 0 for each i.
We then have a proposal distribution q = (qi : i ∈ Z), usually centred around the current
state. This needs to be something we can draw from. For example,

1
 2 if i = 1,

qi = 21 if i = −1,

0 otherwise.

We set a starting value j ∈ Z and set X0 = j. The following algorithm is known as the
Metropolis-Hastings algorithm. For each n ⩾ 1:

1. Generate z ∼ q.

2. Set Q = Xn−1 + z.

3. Calculate  
πQ q−z
αXn−1 ,Q := min 1, .
πXn−1 qz

4. With probability αXn−1 ,Q , set Xn = Q, otherwise set Xn = Xn−1 .

The process essentially says that the next step of the chain is set to be the proposed variable
with probability αXn−1 ,Q . If the desired distribution is higher at the proposed step than at the
previous step, then the ratio will be greater than 1 so the probability of moving to the proposed
step is 1. If the desired distribution is lower at the proposed step, we are less likely to move to
it: the probability in this case is less than 1. The aim is that repeating the algorithm enough
times generates a Markov chain with convergence to π, assuming irreducibility.
The Ergodic Theorem allows us to establish properties of π, such as cdf and expectation.
(For more detail, see Problem Sheet 5, Question 3.)
A SUMMARY 44

A Summary
A.1 Basics of Markov chains
Definition (Stochastic matrix). A matrix P = (pij : i, j ∈ I) is a stochastic matrix if
(i) 0 ⩽ pij ⩽ 1 ∀ i, j ∈ I
P
(ii) ∀ i ∈ I, we have j∈I pij = 1

Definition (Markov chain). Let λ be a distribution and P be a stochastic matrix. We say


that (Xn )n⩾0 is a Markov chain with initial distribution λ and stochastic matrix P if
(i) X0 ∼ λ

(ii) P(Xn+1 = in+1 | X0 = i0 , . . . , Xn = in ) = pin in+1 ∀ i0 , . . . , in+1 ∈ I


We denote this by (Xn )n⩾0 ∼ Markov(λ, P ).

Theorem. A discrete-time stochastic process (Xn )n⩾0 is Markov(λ, P ) if and only if

P(X0 = i0 , . . . , XN = iN ) = λi0 pi0 i1 · · · piN −1 iN ∀ i0 , . . . , iN ∈ I.

Define the unit mass distribution: δi := (δij : j ∈ I), where


(
1 if i = j,
δij =
0 otherwise.

Theorem (Markov property). Let (Xn )n⩾0 ∼ Markov(λ, P ). Then, conditional on the event
{Xm = i}, we have (Xm+n )n⩾0 ∼ Markov(δi , P ) and is independent of the random variables
X0 , . . . , X m .

(n)
Theorem (n-step probabilities). Let (Xn )n⩾0 ∼ Markov(λ, P ) and denote by pij =: (P n )ij
the (i, j)th entry of P n . Then
(n)
(i) Pi (Xn = j) = pij

(ii) Pλ (Xn = j) := P(Xn = j | X0 ∼ λ) = (λP n )j

(iii) Pi (Xn = j) = P(Xm+n = j | Xm = i)


In practice, these are computed using the difference equation method (Example 1.3), or
the eigenvalue method (Example 1.4).

Definition (Communicating states). Let i, j ∈ I and P be a stochastic matrix. We say


(n)
that i talks to j, denoted by i → j, if ∃ n ⩾ 0 such that pij > 0.
(n)
We say that i communicates with j, denoted by i ↔ j, if ∃ m, n ⩾ 0 such that pij > 0 and
(m)
pji > 0.
A SUMMARY 45

Definition (Communicating class). A communicating class C ⊆ I is a maximal collection


of states such that ∀ i, j ∈ C we have i ↔ j and ∀ i ∈ C, ∄ k ∈ C c such that i ↔ k.

Definition (Irreducible, closed class, absorbing state).


(i) A Markov chain is irreducible if i ↔ j ∀ i, j ∈ I.

(ii) A class C is closed if ∀ i ∈ C, i → j ⇒ j ∈ C.

(iii) A state i is absorbing if {i} is a closed class.

Definition (Simple random walk). Let p ∈ [0, 1]. A simple random walk (SRW) on Z is a
Markov chain with state space I = Z, where ∀ i, j ∈ I,

p
 if j = i + 1,
pij = q := 1 − p if j = i − 1,

0 otherwise.

If p = q = 1/2, we call the process a simple symmetric random walk.

Definition (Stopping time). Let (Xn )n⩾0 ∼ Markov(λ, P ). A random variable T : Ω →


N ∪ {∞} is a stopping time for (Xn )n⩾0 if for every n ⩾ 0 the event {T = n} only depends on
X0 , . . . , X n .

Definition (Hitting time). Let (Xn )n⩾0 ∼ Markov(λ, P ) and A ⊆ I. The hitting time of A,
denoted H A , is a stopping time given by

H A (ω) := inf{n ⩾ 0 : Xn (ω) ∈ A}

where we take inf ∅ = ∞.

Definition (Hitting probability, expected hitting time). Let (Xn )n⩾0 ∼ Markov(λ, P )
and A ⊆ I. The hitting probability of A from a state i ∈ I is given by

hA A
i := Pi (H < ∞)

and the expected hitting time of A from i is given by


X
kiA := Ei [H A ] = nPi (H A = n)
n⩽∞

Theorem (Hitting probabilities). The vector of hitting probabilities hA = (hAi : i ∈ I) for


A ⊆ I is the minimal non-negative solution of the system of linear equations

hA
i = X
1 if i ∈ A,
A
hA
i = pij hj if i ∈
/ A.
j∈I
A SUMMARY 46

Example 1.7 (Gambler’s ruin).

p p
1 0 1 2 ···
q q q

Case 1: q > p. We have h0i = 1 ∀ i ⩾ 0.


Case 2: q < p. We have h00 = 1, and h0i = (q/p)i ∀ i ⩾ 1.
Case 3: q = p = 1/2. We have h0i = 1 ∀ i ⩾ 0.

Theorem (Expected hitting times). The vector of expected hitting times k A = (kiA : i ∈ I)
for A ⊆ I is the minimal non-negative solution to the system of linear equations

kiA = 0 X if i ∈ A,
A
kiA = 1 + pij kj if i ∈
/ A.
j ∈A
/

Theorem (Strong Markov property). Let (Xn )n⩾0 ∼ Markov(λ, P ) and let T be a stopping
time for (Xn )n⩾0 . Then, conditional on {XT = i} and {T < ∞}, we have (XT +n )n⩾0 ∼
Markov(δi , P ) and is independent of X0 , . . . , XT .

A.2 Recurrence and transience


Definition (Recurrent and transient states). A state i ∈ I is recurrent if

P(Xn = i for infinitely many n) = 1.

It is transient if
P(Xn = i for infinitely many n) = 0.

Definition (Passage times, excursion times). The first passage time of state i is the
stopping time
Ti (ω) := inf{n ⩾ 1 : Xn (ω) = i}
and inductively we define the rth passage time for r = 1, 2, 3, . . . by the stopping time
n o
(r) (r−1)
Ti (ω) := inf n > Ti : Xn (ω) = i ,

(0)
and Ti (ω) = 0.
The rth excursion time is defined by
(
(r) (r−1) (r−1)
(r) Ti − Ti if Ti < ∞,
Si :=
∞ otherwise.
A SUMMARY 47

(r−1) (r)
Lemma. For r ⩾ 2, conditional on {Ti < ∞}, the time period Si is independent of all
(r−1)
{Xm : m ⩽ Ti } and
 
(r) (r−1)
P Si = n Ti < ∞ = Pi (Ti = n).

Lemma. Let ∞
1{Xn =i} ,
X
Vi := fi := Pi (Ti < ∞).
n=0
For r = 0, 1, 2, . . . we have
Pi (Vi > r) = (fi )r ,
i.e. Vi ∼ Geometric(1 − fi ).

Theorem (Characterisation of recurrent and transient states).


(n)
(i) If Pi (Ti < ∞) = 1, then i is recurrent and ∞
P
n=1 pii = ∞.

(n)
(ii) If Pi (Ti < ∞) < 1, then i is transient and ∞
P
n=1 pii < ∞.

Theorem (Recurrence and transience are class properties). Let C be a communicating


class. Then either all states in C are transient, or all states in C are recurrent.

Theorem. Every recurrent class is closed.

Theorem. Every finite closed class is recurrent.

Theorem. Suppose that P is irreducible and recurrent. Then ∀ j ∈ I we have P(Tj < ∞) = 1.

Definition (Component-independent simple random walk). A component-independent


simple random walk (CISRW) on Zd for d ∈ N is defined by
Xn+1 = Xn + Zn+1
1 d

where Zn+1 = Zn+1 , . . . , Zn+1 , with
(
j 1 with probability pj
Zm =
−1 with probability qj := 1 − pj
j
and the Zm are independent for all j, m.

Proposition. Let (Xn )n⩾0 be a CISRW on Zd . If ∃ j ∈ {1, . . . , d} such that pj ̸= 1/2, then
(Xn )n⩾0 is transient (every i ∈ Zd is transient).

For a CISRW with all components symmetric (pj = qj = 1/2 ∀ j ∈ {1, . . . , d}) then
• all states are recurrent for d ⩽ 2
• all states are transient for d ⩾ 3
A SUMMARY 48

A.3 Branching processes


Epidemiology – SIR/SIS. Consider a closed population of N individuals and define

Sn = number of susceptibles at time n


In = number of infectives at time n
Rn = number of removed/recovered at time n.

We consider the random variable Xn := (Sn , In ). We have the following dynamics:


• If a susceptible comes into contact with an infective in a given time step, they become
infected.
• At every time step, every susceptible individual avoids each infective individual with
probability p, independently of all other interactions. So

P(ith susceptible at time n avoids infection by time n + 1 | In ) = pIn .

• The recovery/removal times for infectives are i.i.d.

TI ∼ Geometric(γ), where γ is the recovery rate.

The time must be geometric for (Xn )n⩾0 to be a Markov chain.

Example 3.1 (SIS). No recovery/removal. Two communicating classes {0, . . . , N − 1} and


{N }.

Example 3.2 (SIR). All components included. The states {(k, 0) : k ∈ {0, . . . , N }} are all
absorbing. All other states are transient and form single communicating classes.

Definition (Basic reproduction number). The basic reproduction number R0 is the average
number of infectives caused by an infective individual during an early stage of an epidemic
(typically in a large population).
When N is large we have the approximation
N (1 − p)
R0 ≈ .
γ

The branching process. Let Xn represent the number of individuals in a population at time
n. At each time step, each individual in the population gives birth to a random number of
offspring which make up the next generation. So
Xn
i.i.d.
X
Xn+1 = Zin , Zin ∼ Z
i=1

where Z is the common offspring distribution, and Zin denotes the number of offspring from the
ith individual at time n. So (Xn )n⩾0 is a Markov chain with state space N, and 0 an absorbing
state.
A SUMMARY 49

Define the following PGFs

G(s) := E[sZ ]
Fn (s) := E[sXn | X0 = 1].

Proposition. We have
Fn (s) = |G ◦ ·{z
· · ◦ G}(s) = G(Fn−1 (s)).
n times

Proposition (Mean of a branching process). Let µ = E[Z]. Then

E[Xn | X0 = 1] = µn .

Proposition. Let
xn = P(Xn = 0 | X0 = 1) = Fn (0).
Then x1 = G(0), and xn+1 = G(xn ) for n ⩾ 2.

Theorem (Extinction probability). The extinction probability is the minimal non-negative


root of
α = G(α).

Theorem (Mean criterion for extinction). Suppose that G(0) > 0, and µ := E[Z] = G′ (1),
with Z finite almost surely. Then
(i) if µ ⩽ 1, extinction is certain;
(ii) if µ > 1, extinction is not certain.

A.4 Invariant distributions


Definition (Invariant measure, invariant distribution). A measure λ = (λi : i ∈ I) with
non-negative entries is said to be invariant for P if λP = λ.
P We say π = (πi : i ∈ I) is an invariant distribution for P if π is an invariant measure and
i∈I πi = 1.

Theorem. Let (Xn )n⩾0 ∼ Markov(π, P ) where π is an invariant distribution for P . Then
(Xm+n )n⩾0 ∼ Markov(π, P ) ∀ m ⩾ 0.

(n)
Theorem. Let I be finite and suppose that ∃ i ∈ I such that pij → πj as n → ∞ ∀ j ∈ I.
Then π := (πj : j ∈ I) is an invariant distribution.

Definition (Expected time in a state). The expected time spent in state i between consec-
utive visits to state k is "T −1 #
k

1{Xn =i}
X
γik := Ek
n=0

where Tk is the first passage time of state k.


A SUMMARY 50

Theorem. Let P be irreducible and recurrent. Then ∀ k ∈ I,

(i) γkk = 1

(ii) γ k := (γik : i ∈ I) is an invariant measure for P

(iii) 0 < γik < ∞ ∀ i ∈ I

Theorem (Uniqueness of measure). Let P be irreducible and λ an invariant measure for


P , with λk = 1 for some k. Then λ ⩾ γ k . If P is also recurrent, then λ = γ k .

Definition (Positive recurrence). A state i ∈ I is positive recurrent if it is recurrent and

mi := Ei [Ti ] < ∞.

A state which is recurrent but not positive recurrent is said to be null recurrent.

Theorem (Invariance equivalences). Let P be irreducible. The following are equivalent:

(i) every state is positive recurrent

(ii) some state i is positive recurrent

(iii) P has an invariant distribution π

Moreover, πi = 1/mi .

Definition (Periodicity). The periodicity of a state i ∈ I is


(n)
d := gcd{n > 0 : pii > 0}.

Definition (Aperiodic). A state i ∈ I is aperiodic if it has periodicity 1.

(n)
Proposition. A state i ∈ I is aperiodic if and only if ∃ N ∈ N with pii > 0 ∀ n ⩾ N .

Lemma (Aperiodicity is a class property). Suppose P is irreducible and i is aperiodic.


(n)
Then ∀ j, k ∈ I, ∃ N ∈ N such that pjk > 0 ∀ n ⩾ N .

Theorem (Convergence to equilibrium). Let P be irreducible and aperiodic with invariant


distribution π. Let λ be any distribution. If (Xn )n⩾0 ∼ Markov(λ, P ) then

P(Xn = j) → πj as n → ∞.

In particular
(n)
pij → πj as n → ∞ ∀ i, j ∈ I.
A SUMMARY 51

Theorem. Let P be irreducible with invariant distribution π. Suppose that (Xn )0⩽n⩽N ∼
Markov(π, P ) and set Yn = XN −n . Then (Yn )0⩽n⩽N ∼ Markov(π, Pb), where Pb = (p̂ij : i, j ∈ I)
is given by
πj p̂ji = πi pij ,
and Pb is also irreducible with invariant distribution π.

Definition (Time reversal). The chain (Yn )0⩽n⩽N defined above is called the time reversal
of (Xn )0⩽n⩽N .

Definition (Detailed balance). A stochastic matrix P and measure λ are said to be in


detailed balance if
λi pij = λj pji ∀ i, j ∈ I.

Lemma. If P and λ are in detailed balance, then λ is invariant for P .

Definition (Reversible chain). Suppose (Xn )n⩾0 is Markov(λ, P ) where P is irreducible. We


say that (Xn )n⩾0 is reversible if ∀ N ⩾ 1, the chain (XN −n )0⩽n⩽N is also Markov(λ, P ).

Theorem. Let (Xn )n⩾0 ∼ Markov(λ, P ) where P is irreducible. Then (Xn )n⩾0 is reversible if
and only if P and λ are in detailed balance.

Theorem (Ergodic Theorem). Let P be irreducible and λ be any distribution. Let


n−1
1{Xk =i}
X
Vi (n) :=
k=0

be the number of visits to state i before time n. If (Xn )n⩾0 ∼ Markov(λ, P ), then
 
Vi (n) 1
P → as n → ∞ = 1.
n mi

Moreover, if (Xn )n⩾0 is positive recurrent, then for any f : I → R such that Eπ (|f (X)|) < ∞,
we have !
n−1
1X
P f (Xk ) → f as n → ∞ = 1
n k=0
where X
f := Eπ (f (X)) = πi f (i).
i∈I

You might also like