MAST30001 Stochastic Modelling
Lecturer: Nathan Ross
Administration
LMS - announcements, grades, course documents
Lectures/Practicals
Student-staff liaison committee (SSLC) representative
Slide 1
Modelling
We develop an imitation of the system. It could be, for example,
I
a small replica of a marina development,
a set of equations describing the relations between stock
prices,
a computer simulation that reproduces a complex system
(think: the paths of planets in the solar system).
We use a model
I
to understand the evolution of a system,
to understand how outputs relate to inputs, and
to decide how to influence a system.
Slide 2
Why do we model?
We want to understand how a complex system works. Real-world
experimentation can be
I
too slow,
too expensive,
possibly too dangerous,
may not deliver insight.
The alternative is to build a physical, mathematical or
computational model that captures the essence of the system that
we are interested in (think: NASA).
Slide 3
Why a stochastic model?
We want to model such things as
I
traffic in the Internet
stock prices and their derivatives
waiting times in healthcare queues
reliability of multicomponent systems
interacting populations
epidemics
where the effects of randomness cannot be ignored.
Slide 4
Good mathematical models
capture the non-trivial behaviour of a system,
are as simple as possible,
replicate empirical observations,
are tractable - they can be analysed to derive the quantities of
interest, and
can be used to help make decisions.
Slide 5
Stochastic modelling
Stochastic modelling is about the study of random experiments.
For example,
I
toss a coin once, toss a coin twice, toss a coin infinitely-many
times
the lifetime of a randomly selected battery (quality control)
the operation of a queue over the time interval [0, ) (service)
the changes in the US dollar - Australian dollar exchange rate
from 2006 onwards (finance)
the positions of all iphones that make connections to a
particular telecommunications company over the course of one
hour (wireless tower placement)
the network friend structure of Facebook (ad revenue)
Slide 6
Stochastic modelling
We study a random experiment in the context of a Probability
Space (, F, P). Here,
I
the sample space is the set of all possible outcomes of our
random experiment,
the class of events F is a set of subsets of . We view these
as events we can see or measure, and
P is a probability measure defined on the elements of F.
Slide 7
The sample space
We need to think about the sets of possible outcomes for the
random experiments. For those discussed above, these could be
I
{H, T }, {(H, H), (H, T ), (T , H), (T , T )}, the set of all
infinite sequences of Hs and Ts.
[0, ).
the set of piecewise-constant functions from [0, ) to Z+ .
the set of continuous functions from [0, ) to IR+ .
S
n=0 {(x1 , y1 ) . . . (xn , yn )}, giving locations of the phones
when they connected.
Set of simple networks with number of vertices equal to the
number of users: edges connect friends.
Slide 8
Review of basic notions of set theory
A B.
I
A B = { : A or B} = B A.
I
I
Intersection of sets (events):
both occur.
Tn
A1 A2 An = i=1 Ai .
Ac = { : w 6 A}
I
Union of sets (events):Sat least one occurs.
n
A1 A2 An = i=1 Ai .
A B = { : A and B} = B A = AB.
I
A is a subset of B or if A occurs, then B occurs.
Complement of a set/event: event doesnt occur.
: the empty set or impossible event.
Slide 9
The class of events F
For discrete sample spaces, F is typically the set of all subsets.
Example: Toss a coin once, F = {, {H}, {T }, {H, T }}
For continuous state spaces, the situation is more complicated:
Slide 10
The class of events F
S equals circle of radius 1.
We say two points on S are in the same family if you can get
from one to the other by taking steps of arclength 1 around
the circle.
Each family chooses a single member to be head.
If X is a point chosen uniformly at random from the circle,
what is the chance X is the head of its family?
Slide 11
The class of events F
A = {X is head of its family}.
Ai = {X is i steps clockwise from its family}.
Bi = {X is i steps counterclockwise from its family}.
By uniformity, P(A) = P(Ai ) = P(Bi ), BUT
law of total probability:
1 = P(A) +
(P(Ai ) + P(Bi ))!
i=1
The issue is that the event A is not one we can see or measure so
should not be included in F.
Slide 12
The class of events F
These kinds of issues are technical to resolve and are dealt with in
later probability or analysis subjects which use measure theory.
Slide 13
The probability measure P
The probability measure P on (, F) is a set function from F
satisfying
P1. P(A) 0 for all A F [probabilities measure long run %s or
certainty]
P2. P() = 1 [There is a 100% chance something happens]
P3. Countable
S additivity:PifA1 , A2 are disjoint events in F,
then P (
i=1 Ai ) =
i=1 P(Ai ) [Think about it in terms of
frequencies]
Slide 14
How do we specify P?
The modelling process consists of
I
defining the values of P(A) for some basic events in A F,
deriving P(B) for the other unknown more complicated
events in B F from the axioms above.
Example: Toss a fair coin 1000 times. Any length 1000 sequence
of Hs and Ts has chance 21000 .
I
What is the chance there are more than 600 Hs in the
sequence?
What is the chance the first time the proportion of heads
exceeds the proportion of tails occurs after toss 20?
Slide 15
Properties of P
P() = 0.
P(Ac ) = 1 P(A).
P(A B) = P(A) + P(B) P(A B).
Slide 16
Conditional probability
Let A, B F be events with P(B) > 0. Supposing we know that
B occurred, how likely is A given that information? That is, what
is the conditional probability P(A|B)?
For a frequency interpretation, consider the situation where we
have n trials and B has occurred nB times. What is the relative
frequency of A in these nB trials? The answer is
nAB /n
P(A B)
nAB
=
.
nB
nB /n
P(B)
Hence, we define
P(A|B) =
P(A B)
.
P(B)
We need a more sophisticated definition if we want to define the
conditional probability P(A|B) when P(B) = 0.
Slide 17
Example:
Tickets are drawn consecutively and without replacement from a
box of tickets numbered 1 10. What is the chance the second
ticket is even numbered given the first is
I
even?
labelled 3?
Slide 18
Bayes formula
Let B1 , B2 , , Bn be mutually exclusive events with A
then
n
X
P(A) =
P(A|Bj )P(Bj ).
Sn
j=1 Bj ,
j=1
With the same assumptions as for the Law of Total Probability,
P(Bj |A) =
P(A|Bj )P(Bj )
P(Bj A)
= Pn
.
P(A)
k=1 P(A|Bk )P(Bk )
Slide 19
Example:
A disease affects 1/1000 newborns and shortly after birth a baby is
screened for this disease using a cheap test that has a 2% false
positive rate (the test has no false negatives). If the baby tests
positive, what is the chance it has the disease?
Slide 20
Independent events
Events A and B are said to be independent if
P(A B) = P(A)P(B).
If P(B) 6= 0 or P(A) 6= 0 then this is the same as P(A|B) = P(A)
and P(B|A) = P(B).
Events A1 , , An are independent if, for any subset {i1 , ..., ik } of
{1, ..., n},
P(Ai1 Aik ) = P(Ai1 ) P(Aik ).
Slide 21
Random variables
A random variable (rv) on a probability space (, F, P) is a
function X : IR.
Usually, we want to talk about the probabilities that the values of
random variables lie in sets of the form (a, b) = {x : a < x < b}.
The smallest -algebra of subsets of IR that contains these sets is
called the set B(IR) of Borel sets, after Emile Borel (1871-1956).
The probability that X (a, b) is the probability of the subset
{ : X () (a, b)}. In order for this to make sense, we need this
set to be in F and we require this condition for all a < b and we
say the function X is measurable with respect to F.
So X is measurable with respect to F if { : X () B} F for
all Borel sets B IR .
Slide 22
Distribution Functions
The function FX (t) = P(X t) = P({ : X () (, t]} that
maps R to [0, 1] is called the distribution function of the random
variable X .
Any distribution function F
F1. is non-decreasing,
F2. is such that F (x) 0 as x and F (x) 1 as x ,
and
F3. is right-continuous, that is limh0+ F (t + h) = F (t) for all t.
Slide 23
Distribution Functions
We say that
I
the random variable X is discrete if it can take only
countably-many
values, with P(X = xi ) = pi > 0 and
P
p
=
1.
Its
distribution
function FX (t) is commonly a step
i i
function.
the random variable X is continuous if FX (t) is absolutely
continuous, that is if there exists
R t a function fX (t) that maps
R to R+ such that FX (t) = fX (u)du.
A mixed random variable has some points that have positive
probability and also some continuous parts.
Slide 24
Examples of distributions
Examples of discrete random variables: binomial, Poisson,
geometric, negative binomial, discrete uniform
http://en.wikipedia.org/wiki/Category:
Discrete_distributions
Examples of continuous random variables: normal,
exponential, gamma, beta, uniform on an interval (a, b)
http://en.wikipedia.org/wiki/Category:
Continuous_distributions
Slide 25
Random Vectors
A random vector X = (X1 , ..., Xd ) is a measurable mapping of
(, F) to IRd , that is, for each Borel set B IRd ,
{ : X () B} F.
The distribution function of a random vector is
FX (t) = P(X1 t1 , , Xd td ), t = (t1 , , td ) Rd .
It follows that
P(s1 < X1 t1 , s2 < X2 t2 )
= F (t1 , t2 ) F (s1 , t2 ) F (t1 , s2 ) + F (s1 , s2 ).
Slide 26
Independent Random Variables
The random variables X1 , , Xd are called independent if
FX (t) = FX1 (t1 ) FXd (td ) for all t = (t1 , , td ).
Equivalently,
I
the events {X1 B1 }, , {Xd Bd } are independent for all
Borel sets B1 , , Bd R,
or, in the absolutely-continuous case,
fX (t) = fX1 (t1 ) fXd (td ) for all t = (t1 , , td ).
Slide 27
Revision Exercise
For bivariate random variables (X , Y ) with density functions
I
I
f (x, y ) = 2x + 2y 4xy for 0 < x < 1, 0 < y < 1, and
f (x, y ) = 4 8x 8y + 16xy for 0 < x < 1, 0 < y < 1,
0 < x + y < 1,
I
I
I
check f is a true density.
find the marginal probability density functions fX (x) and fY (y ),
find the probability density function of Y conditional on the
value of X .
Slide 28
Expectation of X
For a discrete, continuous or mixed random variable X that takes
on values in the set SX , the expectation of X is
Z
E (X ) =
xdFX (x)
SX
The integral on the right hand side is a Lebesgue-Stieltjes integral.
It can be evaluated as
P
x P(X = xi ), if X is discrete
= R i=1 i
xf
if X is absolutely continuous.
X (x)dx,
In second year, we required that the integral be absolutely
convergent. In this course we will allow the expectation to be
infinite, provided that we never get in a situation where we have
.
Slide 29
Expectation of g (X )
For a measurable function g that maps SX to some other set SY ,
Y = g (X ) is a random variable taking values in SY and
Z
E (Y ) = E (g (X )) =
g (x)dFX (x).
SX
We can also evaluate E (Y ) by calculating its distribution function
FY (y ) and then using the expression
Z
E (Y ) =
ydFY (y ).
SY
Slide 30
Properties of Expectation
E (aX + bY ) = aE (X ) + bE (Y ).
If X Y , then E (X ) E (Y ).
If X c, then E (X ) = c.
Slide 31
Moments
The kth moment of X is E (X k ).
The kth central moment of X is E [(X E (X ))k ].
The variance V (X ) of X is the second central moment
E (X 2 ) (E (X ))2 .
V (cX ) = c 2 V (X ).
If X and Y have finite means and are independent, then
E (XY ) = E (X )E (Y ).
If X and Y are independent (or uncorrelated), then
V (X Y ) = V (X ) + V (Y ).
Slide 32
Conditional Probability
The conditional probability of event A given X is a random
variable (since it is a function of X ). We write it as P(A|X ).
I
for a real number x, if P(X = x) > 0, then
P(A|x) = P(A {X = x})/P({X = x}).
if P(X = x) = 0, then
P(A|x) = lim+ P(A{X (x, x+)})/P({X (x, x+)}).
0
Slide 33
Conditional Distribution
The conditional distribution function FY |X (y |X ) of Y
evaluated at the real number y is given by P({Y y }|X ),
where P({Y y }|x) is defined on the previous slide.
If (X , Y ) is absolutely continuous, then the conditional density
of Y given that X = x is fY |X (y |x) = f(X ,Y ) (x, y )/fX (x).
Slide 34
Conditional Expectation
The conditional expectation E (Y |X ) = (X ) where
(x) = E (Y |X = x)
P
y P(Y = yj |X = x) if Y is discrete
R j j
=
SY yfY |X (y |x)dy if Y is absolutely continuous.
Slide 35
Properties of Conditional Expectation
Linearity: E (aY1 + bY2 |X ) = aE (Y1 |X ) + bE (Y2 |X ),
Monotonicity: Y1 Y2 , then E (Y1 |X ) E (Y2 |X ),
E (c|X ) = c,
E (E (Y |X )) = E (Y ),
For any measurable function g , E (g (X )Y |X ) = g (X )E (Y |X )
E (Y |X ) is the best predictor of Y from X in the mean square
sense. This means that, for all random variables Z = g (X ),
the expected quadratic error E ((g (X ) Y )2 ) is minimised
when g (X ) = E (Y |X ) (see Borovkov, page 57).
Slide 36
Exercise
Let = {a, b, c, d}, P({a}) =
P({d}) = 14 .
Define random variables,
Y () =
X () =
1
2,
P({b}) = P({c}) =
1,
0,
= a or b,
= c or d,
2,
5,
= a or c,
= b or d.
1
8
and
Compute E (X ), E (X |Y ) and E (E (X |Y )).
Slide 37
Example
The number of storms, N, in the upcoming rainy season is
distributed according to a Poisson distribution with a parameter
value that is itself random. Specifically, is uniformly
distributed over (0, 5). The distribution of N is called a mixed
Poisson distribution.
1. Find the probability there are at least two storms this season.
2. Calculate E (N|) and E (N 2 |).
3. Derive the mean and variance of N.
Slide 38
Exercise
The joint density of X and Y is given by
fX ,Y (x, y ) =
e x/y e y
, x > 0, y > 0.
y
Calculate E [X |Y ] and then calculate E [X ].
Slide 39
Limit Theorems (Borovkov 2.9)
The Law of Large Numbers (LLN) states that if X1 , X2 , are
independent and identically-distributed with mean , then
n
1X
Xj
Xn
n
j=1
as n .
In the strong form, this is true almost surely, which means that it
is true on a set A of sequences x1 , x2 , . . . that has probability one.
In the weak form, this is true in probability which means that, for
all > 0,
P(|Xn | > ) 0
as n .
Slide 40
Limit Theorems (Borovkov 2.9)
The Central Limit Theorem (CLT) states that if X1 , X2 , are
independent and identically-distributed with mean and variance
2 , then for any x,
Z x
Xn
1
2
< x (x)
e t /2 dt
P
/ n
2
as n .
That is, a suitably-scaled variation from the mean approaches a
standard normal distribution as n .
Slide 41
Limit Theorems (Borovkov 2.9)
The Poisson Limit Theorem states that that if X1 , X2 , are
independent Bernoulli random variables with
P(Xi = 1) = 1 P(Xi = 0) = pi , then X1 + X2 + + Xn is
well-approximated by a Poisson random variable with parameter
= p1 + + pn .
Specifically, with W = X1 + X2 + + Xn , then, for any Borel set
B R,
P(W B) P(Y B)
where Y Po().
There is, in fact, a bound on the accuracy of this approximation
Pn
2
i=1 pi
,
|P(W B) P(Y B)|
max(1, )
Slide 42
Example
Suppose there are three ethnic groups, A (20%), B (30%) and C
(50%), living in a city with a large population. Suppose 0.5%, 1%
and 1.5% of people in A, B and C respectively are over 200cm tall.
If we know that of 300 selected, 50, 50 and 200 people are from A,
B and C, what is the probability that at least four will be over 200
cm?
Slide 43
Stochastic Processes (Borovkov 2.10)
A collection of random variables {Xt , t T } (or {X (t), t T })
on a common prob space (, F, P) is called a stochastic process.
The index variable t is often called time.
I
If T = {1, 2, } or { , 2, 1, 0, 1, 2, }, the process is
a discrete time process.
If T = IR or [0, ), the process is a continuous time process
If T = IRd , then the process is a spatial process, for example
temperature at t T IR2 , which could be, say, the
University campus.
Slide 44
Examples of Stochastic Processes
If Xt N(0, 1) for all t, then Xt is called a Gaussian process.
Different processes can be modelled by making different
assumptions about the dependence between the Xt for different t.
Standard Brownian Motion is a Gaussian process where
I For any 0 s1 < t1 s2 < sk < tk , X (t1 ) X (s1 ),
. . . , X (tk ) X (sk ) are independent.
I We also have V (X (t1 ) X (s1 )) = t1 s1 for all s1 < t1 .
http://www.ms.uky.edu/~mai/java/stat/brmo.html
Brownian motion
Lect 04
620-301
Slide 45
Examples of Stochastic Processes
Xt is the number of sales of an item up to time t.
{Xt , t 0} is called a counting process.
Slide 46
Examples of Stochastic Processes
Xt is the number of people in a queue at time t.
Lect 04
620-301
Slide 47
Interpretations
We can think of as consisting of the set of sample paths
= {Xt : t T }, that is a set of sequences if T is discrete or a
set of functions if T is continuous. Each has a value at
each time point t T . With this interpretation,
I
For a fixed , we can think of t as a variable, X (t) as a
deterministic function (realization, trajectory, sample path) of
the process.
If we allow to vary, we get a collection of trajectories.
For fixed t, with varying, we see that Xt () is a random
variable.
If both and t are fixed, then Xt () is a real number.
Slide 48
Examples of Stochastic Processes
If Xt is a counting process:
I
For fixed , Xt () is a non-decreasing step function of t.
For fixed t, Xt () is a non-negative integer-valued random
variable.
For s < t, Xt Xs is the number of events that have occurred
in the interval (s, t].
If Xt is the number of people in a queue at time t, then
{Xt : t 0} is a stochastic process where, for each t, Xt () is a
non-negative integer-valued random variable but it is NOT a
counting process because, for fixed , Xt () can decrease.
Slide 49
Finite-Dimensional Distributions
Knowing just the one-dimensional (individual) distributions of Xt
for all t is not enough to describe a stochastic process.
To specify the complete distribution of a stochastic process
{Xt , t T }, we need to know the finite-dimensional distributions
that is the family of joint distribution functions
Ft1 ,t2 , ,tk (x1 , , xk ) of Xt1 , , Xtk for all k 1 and
t1 , , tk T .
Slide 50
Discrete-Time Markov Chains
We are frequently interested in applications where we have a
sequence X1 , X2 , of outputs (which we model as random
variables) in discrete time. For example,
I
I
DNA: A (adenine), C (cytosine), G (guanine), T (thymine).
Texts: Xj takes values in some alphabet, for example
{A, B, , Z , a, }.
I
I
I
Developing and testing compression software.
Cryptology: codes, encoding and decoding.
Attributing manuscripts.
Slide 51
Independence?
Is it reasonable to assume that neighbouring letters are
independent?
I
Text T = {a1 an } of length n.
Let n` = #{i n : ai = `}, n`j = #{i n 1 : ai ai+1 = `j}.
Assuming that T is random, we expect n` /n P(letter = `)
and n`j /n P(two letters = `j).
If letters were independent, we have
P(two letters = `j) = P(letter = `)P(letter = j) so we would
expect that n`j /n n` /n nj /n.
However, let ` = j = a, P(letter = a) 0.08, but aa is very
rare.
We conclude that assuming independence does not lead to a good
model for text.
Slide 52
The Markov Property
The Markov property embodies a natural first generalisation to the
independence assumption. It assumes a kind of one-step
dependence or memory. Specifically, for all Borel sets B,
P(Xn+1 B|Xn = xn , Xn1 = xn1 , ) = P(Xn+1 B|Xn = xn )
Slide 53
Discrete-Time Markov Chains
A random sequence {Xn , n 0} with a countable state space
(without loss {1, 2, }) forms a DTMC if
P(Xn+1 = k|Xn = j, Xn1 = xn1 , , X0 = x0 ) = P(Xn+1 = k|Xn = j).
This enables us to write
P(Xn+1 = k|Xn = j) = pjk (n).
Furthermore, we commonly assume that the transition probabilities
pjk (n) do not depend on n, in which case the DTMC is called
homogeneous (more precisely time homogeneous) and we write
pjk (n) = pjk .
Slide 54
Discrete-Time Markov Chains
For a homogeneous DTMC, we define the transition matrix to be a
matrix with rows and columns corresponding to the states of the
process and whose jkth entry is pjk . So
p11 p12
p21 p22
P=
pm1 pm2
p1m
p2m
.
pmm
Slide 55
Discrete-Time Markov Chains
We can associate a directed graph with a DTMC by letting the
nodes correspond to states and putting in an arc jk if pjk > 0.
Slide 56
Discrete-Time Markov Chains
For a transition matrix of a DTMC:
I
Each entry is 0.
Each row sums to 1.
Any square matrix having these two properties is called a
stochastic matrix.
Slide 57
Discrete-Time Markov Chains
Examples:
I
If the {Xn } are independent and identically-distributed
random variables with P(Xi = k) = pk , what is the transition
matrix of the DTMC?
A communication system transmits the digits 0 and 1. At
each time point, there is a probability p that the digit will not
change and prob 1 p it will change.
Slide 58
Discrete-Time Markov Chains
I
Suppose that whether or not it rains tomorrow depends on
previous weather conditions only through whether or not it is
raining today and not on past weather conditions. Suppose
also that if it rains today, then it will rain tomorrow with
probability p and if it does not rain today, then it will rain
tomorrow with probability q. If we say that the process is in
state 0 when it rains and state 1 when it does not rain, then
the above is a two-state Markov chain.
A simple random walk. Let a sequence of random variables
{Xn } Z be defined by Xn+1 = Xn + Yn+1 , where {Yn } are
independent and identically-distributed random variables with
P(Yn = 1) = p, P(Yn = 1) = 1 p.
Slide 59
Discrete-Time Markov Chains
The n-step transition probabilities P(Xm+n = j|Xm = i) of a
homogeneous DTMC do not depend on m. For n = 1, 2, , we
denote them by
(n)
pij = P(Xm+n = j|Xm = i).
It is also convenient to use the notation
1 if j = i
(0)
pij :=
0 if j 6= i.
Slide 60
Discrete-Time Markov Chains
The Chapman-Kolmogorov equations show how we can calculate
(n)
the pij from the pij . For n = 1, 2, and any r = 1, 2, , n,
(n)
pij =
(r ) (nr )
pik pkj
Slide 61
Discrete-Time Markov Chains
If we define the n-step transition matrix as
(n)
p11
(n)
p12
..
(n)
(n)
(n)
P (n) =
p21 p22 p23
..
..
..
.
.
.
..
.
..
.
,
..
.
then the Chapman-Kolmogorov equations can be written in the
matrix form
P (n) = P (r ) P (nr )
with P (1) = P. By mathematical induction, it follows that
P (n) = P n ,
the nth power of P.
Slide 62
Discrete-Time Markov Chains
How do we determine the distribution of a DTMC?
We have
I
the initial distribution 0 = (10 , . . . , n0 ), where
j0 = P(X0 = j), for all j, and
the transition matrix P.
In principle, we can use these and the Markov property to derive
the finite dimensional distributions, although the calculations are
frequently intractable.
For k 1 and t1 < < tk Z+ ,
P(Xt1 = x1 , Xt2 = x2 , , Xtk = xk )
= [ 0 P t1 ]x1 [P t2 t1 ]x1 x2 , . . . , [P tk tk1 ]xk xk1 .
Slide 63
Discrete-Time Markov Chains
Example
Suppose P(X0 = 1) = 1/3, P(X0 = 2) = 0, P(X0 = 3) = 1/2,
P(X0 = 4) = 1/6 and
1/4 0 1/4 1/2
1/4 1/4 1/4 1/4
.
P=
0
0 2/3 1/3
1/2 0 1/2 0
I
Find the distribution of X1 ,
Calculate P(Xn+2 = 2|Xn = 4), and
Calculate P(X3 = 2, X2 = 3, X1 = 1).
Slide 64
Discrete-Time Markov Chains
Fundamental questions that we quite often want to ask are
I
What proportion of time does the chain spend in each state in
the long run?
Or does this even make sense?
The answer depends on the classification of states.
Slide 65
Discrete-Time Markov Chains
Here are some definitions.
I
State k is accessible from state j, denoted by j k, if there
(n)
exists an n 1 such that pjk > 0. That is, there exists a
path j = i0 , i1 , i2 , , in = k such that pi0 i1 pi1 i2 pin1 in > 0.
If j k and k j, then states j and k communicate,
denoted by j k.
State j is called non-essential if there exists a state k such
that j k but k 6 j.
State j is called essential if j k implies that k j.
A state j is an absorbing state if pjj = 1. An absorbing state
is essential but essential states do not have to be absorbing.
Slide 66
Discrete-Time Markov Chains
Example
Draw a transition diagram and
with transition matrix
0
0.5
P=
0
0
then classify the states of a DTMC
0.5 0.5 0
0
0 0.5
0 0.5 0.5
0 0.5 0.5
Slide 67
Discrete-Time Markov Chains
A state j which is such that j 6 j is called ephemeral. Ephemeral
states usually dont add anything to a DTMC model and we are
going to assume that there are no such states.
With this assumption, the communication relation has the
properties
I
j j (reflexivity),
j k if and only if k j (symmetry), and
if j k and k i, then j i (transitivity).
A relation that satisfies these properties is known as an equivalence
relation.
Slide 68
Discrete-Time Markov Chains
Consider a set S whose elements can be related to each other via
any equivalence relation . Then S can be partitioned into a
collection of disjoint subsets S1 , S2 , S3 , . . . SM (where M might be
infinite) such that j, k Sm implies that j k.
So the state space of a DTMC is partitioned into communicating
classes by the communication relation .
Slide 69
Discrete-Time Markov Chains
An essential state cannot be in the same communicating class as a
non-essential state. This means that we can divide the sets in the
n of
partition S1 , S2 , S3 , . . . SM into a collection of S1n , S2n , S3n , . . . SM
n
non-essential communicating classes and a collection of
e of essential communicating classes.
S1e , S2e , S3e , . . . SM
e
If the DTMC starts in a state from a non-essential communicating
n then once it leaves, it can never return. On the other
class Sm
hand, if the DTMC starts in a state from a essential
e then it can never leave it.
communicating class Sm
Definition:
If a DTMC has only one communicating class, that is all states
communicate with each other, then it is called an irreducible
DTMC.
Slide 70
Discrete-Time Markov Chains
Example
Classify the states of the DTMC with
0.5 0.5
0
0
0.5 0.5
0
0
P=
0.25 0.15 0.45 0.15
0
0
0
1
Slide 71
Discrete-Time Markov Chains
Exercise
Classify the states of the DTMC with
0 0 + 0
0 + 0 +
+ 0 0 0
P=
0 0 0 +
0 + 0 0
0 + 0 0
0 0 + 0
0 0 +
0 0 +
0 0 0
0 0 0
0 0 0
+ + 0
0 0 +
Slide 72
Discrete-Time Markov Chains
Now lets revisit the random walk example where
Xn+1 = Xn + Yn+1 , with {Yn } independent and
identically-distributed with X0 = 0, P(Yn = 1) = p and
P(Yn = 1) = 1 p = q. This DTMC is irreducible and so all
states are essential. However,
I
if p > q, then E (Xn X0 ) = n(p q) > 0, so Xn will drift to
infinity, at least in expectation.
For each fixed state j, with probability one, the DTMC will
visit j only finitely many times.
A states long run essentialness is not captured by being essential
in this case: we need a further classification of the states.
Slide 73
Recurrence and Transience of States
Let X0 = j and Ti (j) be the time between the ith and (i 1)st
return to state j. Then T1 (j), T2 (j), . . . are independent and
identically distributed random variables.
Slide 74
Recurrence and Transience of States
Our further classification relies on calculating the probability that
the DTMC returns to a state once it has left.
Let
fj = P(Xn = j for some n > 0|X0 = j) = P(T1 (j) < |X0 = j).
The state j is said to be recurrent if fj = 1 and transient if fj < 1.
Slide 75
Characterizing Recurrence
If the DTMC starts in a recurrent state j then, with probability
one, it will eventually re-enter j. At this point, the process will
start anew (by the Markov property) and it will re-enter again with
probability one. So the DTMC will (with probability one) visit j
infinitely-many times.
If the DTMC starts in a transient state j then there is a probability
1 fj > 0 that it will never return. So, letting Nj be the number of
visits to state j after starting there, we see that Nj has a geometric
distribution.
Specifically, for n 0,
P(Nj = n|X0 = j) = P(T1 (j) < , , Tn (j) < , Tn+1 (j) = ).
This is equal to (1 fj )fjn , which implies that E (Nj |X0 = j) =
fj
1fj .
Slide 76
Characterizing Recurrence
If
qj =
E [I (Xn = j)|X0 = j] =
n=1
then qj = E [
qj
1+qj .
n=1 I (Xn
(n)
pjj ,
n=1
= j)|X0 = j] = E (Nj |X0 = j) =
fj
1fj ,
so
fj =
It follows that state j is recurrent if and only if qj = .
[Read: j is recurrent if and only if the expected number of returns
to state j is infinite.]
Slide 77
Communication classes are either recurrent or transient
Now assume that state j is recurrent and j k. There must exist
(s)
(t)
s and t such that pjk > 0 and pkj > 0. Then
(s+n+t)
pjj
= P(Xs+t+n = j|X0 = j)
P(Xs+t+n = j, Xs+n = k, Xs = k|X0 = j)
(s) (n) (t)
= pjk pkk pkj .
(s+n+t)
Similarly pkk
(t) (n) (s)
pkj pjj pjk and so, for n > s + t,
(nst)
pjj
(n)
(n+s+t)
pkk pjj
P
(s) (t)
(n)
where = pjk pkj . So the series
n=1 pkk must diverge because
P (n)
n=1 pjj diverges, and we conclude that state k is also recurrent.
Slide 78
If the Markov chain is irreducible then all states are either
recurrent or transient and so its appropriate to refer to the chain
as either recurrent or transient.
Slide 79
The Random Walk
Let Xn+1 = Xn + Yn+1 where {Yn : n 1} are independent and
identically-distributed random variables with P(Yn = 1) = p and
P(Yn = 1) = 1 p = q.
We can compute the m-step transition probabilities from state j to
itself by observing that these probabilities are zero if m is odd and
equal to
2n n n
p q
n
if m = 2n.
Slide 80
The Random Walk
Stirlings formula n!
2nnn e n gives us the fact that
(2n)
pjj
and the series
(4pq)n
,
n
(2n)
n=1 pjj
diverges if p = q = 1/2, so the DTMC is recurrent,
converges if p 6= q (compare to geometric series), so the
DTMC is transient.
Slide 81
Periodicity
The Polya random walk illustrates another phenomenon that can
occur in DTMCs - periodicity.
Definition:
{n :
(n)
pjj
State j is periodic with period d > 1 if
> 0} is non-empty and has greatest common divisor d.
If state j has period 1, then we say that it is aperiodic.
Slide 82
Discrete-Time Markov Chains
Examples
I
The Polya random walk has period d = 2 for all states j.
Whatis the period ofthe DTMC with
0 0.5 0.5
1 0
0 ?
P=
1 0
0
Find
the period for the DTMC
with
0 0 0.5 0.5
1 0 0
0
.
P=
0 1 0
0
0 1 0
0
Slide 83
States in a communicating class have same period
Assume that state j has period dj and j k. Then, as before,
(s)
(t)
there must exist s and t such that pjk > 0 and pkj > 0. We know
straightaway that dj divides s + t since it is possible to go from j
to itself in s + t steps.
Now take a path from k to itself in r steps. If we concatenate our
path from j to k in s steps, this r step path, and our path from
from k to j in t steps, we have an s + r + t step path from j to
itself. So dj divides s + r + t which means that dj divides r . So
the dj divides the period dk of k.
Now we can switch j and k in the argument to conclude that dk
divides dj which means that dj = dk , and all states in the same
communicating class have a common period.
Slide 84
Discrete-Time Markov Chains
The arguments on the preceding slides bring us to the following
theorem, which discusses some solidarity properties of states in the
same communicating class.
Theorem:
In any communicating class Sr of a DTMC with
state space S, the states are
I
either all recurrent or all transient, and
either all aperiodic or all periodic with a common period
d > 1.
If states from Sr are periodic with period d > 1, then
(1)
(2)
(d)
Sr = Sr Sr Sr where the DTMC passes from
(i)
(i+1)
the subclass Sr to Sr
with probability one at a transition.
Slide 85
Discrete-Time Markov Chains
Examples:
d = 4:
Slide 86
Discrete-Time Markov Chains
How do we analyse a DTMC?
I
Draw a transition diagram.
Consider the accessibility of states, then divide the state space
into essential and non-essential states.
Define the communicating classes, and divide them into
recurrent and transient communicating classes.
Decide whether the classes are periodic.
Slide 87
Discrete-Time Markov Chains
Exercises
0 0.5 0.5
0 .
Analyse the DTMC with P = 1 0
1 0
0
0 0 0.5 0.5
1 0 0
0
.
Consider a DTMC with P =
0 1 0
0
0 1 0
0
Slide 88
Discrete-Time Markov Chains
Example
Analyse the Markov chain with states numbered 1 to 5 and with
one-step transition probability matrix
1
0
0 0 0
1/2 0 1/2 0 0
P=
0 1/2 1/2 0 0
0
0
0 0 1
0
0
0 1 0
Slide 89
Finite State DTMCs have at least one recurrent state
Recall that a state j is transient if and only if
X
n=1
(n)
pjj
E [I (Xn = j)|X0 = j] < .
n=1
This means that the DTMC visits j only finitely-many times (with
probability one), given that it starts there.
Let S be the set of states, and fj,k be the probability that the
DTMC ever visits state k, given that it starts in state j.
Slide 90
Finite State DTMCs have at least one recurrent state
If all states k S are transient, then it must be the case that
X
n=1
(n)
pjj
fj,k
(n)
pkk < .
n=1
k6=j
However, the left hand side can be decomposed to the expression
XX
E [I (Xn = k)|X0 = j] =
kS n=1
X
X
E [I (Xn = k)|X0 = j]
n=1 kS
n=1
which is a contradiction, and so at least one state must be
recurrent.
It follows that if a finite-state DTMC is irreducible, then all states
are recurrent.
Slide 91
Recurrence in Infinite State DTMC
When a communicating class has infinitely many states, the above
line of argument does not work:
X
fj,k may be infinite.
k6=j
And it shouldnt: random walk with p > 1/2 has all states
transient.
Slide 92
Recurrence in Infinite State DTMC
In order to be able to tell whether a class is recurrent, we need to
be able to calculate the probability of return for at least one state.
Lets label this state 0 and denote by fj,0 the probability that the
DTMC ever reaches state 0, given that it starts in state j. Then
we see that the sequence {fj,0 } satisfies the equation
X
fj,0 = pj0 +
pjk fk,0 .
()
k6=0
Slide 93
Solving the equation ()
We illustrate how to solve this equation by
Example: Consider a random walk on the nonnegative integers:
pj,j+1 = p = 1 pj,j1 , for j > 0,
and
p0,1 = p = 1 p0,0 .
(And pij = 0 otherwise.) Equation () says that for j > 1,
fj,0 = pfj+1,0 + (1 p)fj1,0
and, for j = 0, 1,
fj,0 = pfj+1,0 + (1 p).
Slide 94
Solving the equation ()
The first equation is a second-order linear difference equation with
constant coefficients.
These can be solved in a similar way to second-order linear
differential equations with constant coefficients, which you learned
about in Calculus II or accelerated Mathematics II.
Recall that, to solve
a
dy
d 2y
+b
+ cy = 0,
dt 2
dt
we try a solution of the form y = y (t) = e t to obtain the
Characteristic Equation
a2 + b + c = 0.
Slide 95
Solving the equation ()
If the characteristic equation has distinct roots, 1 and 2 , the
general solution has the form
y = Ae 1 t + Be 2 t .
If the roots are coincident, the general solution has the form
y = Ae 1 t + Bte 1 t .
In both cases, the values of the constants A and B are determined
by the initial conditions.
Slide 96
Solving the equation ()
The method for solving second-order linear difference equation
with constant coefficients is similar. To solve
auj+1 + buj + cuj1 = 0,
we try a solution of the form u = mj to obtain the Characteristic
Equation
am2 + bm + c = 0.
Slide 97
Solving the equation ()
If this equation has distinct roots, m1 and m2 , the general solution
has the form
y = Am1j + Bm2j .
If the roots are coincident, the general solution has the form
y = Am1j + Bjm1j .
The values of the constants A and B need to be determined by
boundary equations, or other information that we have.
Slide 98
Solving the equation ()
Back to the Example: The characteristic equation of
fj,0 = pfj+1,0 + (1 p)fj1,0
is
pu 2 u + (1 p) = 0
which has roots u = 1 and u = (1 p)/p.
If (1 p)/p 6= 1, the general solution for j 1 is of the form
fj,0 = A + B
1p
p
j
.
Slide 99
Solving the equation ()
If (1 p)/p > 1, then the general solution is
fj,0 = A + B
1p
p
j
.
Similarly, if (1 p)/p = 1, the general solution is of the form
fj,0 = A + Bj.
In either case, these can only be probabilities if B = 0 and then
notice
A = f1,0 = pf2,0 + (1 p) = pA + (1 p),
so A = 1. This makes sense because p 1/2 and so we have a
neutral or downward drift.
Slide 100
Solving the equation ()
However if (1 p)/p < 1, we need to work harder to obtain the
solution to our problem. Let fj,0 (m) be the probability that the
DTMC moves from state j to state 0 in less than or equal to m
steps. Because
m=1 {j 7 0 in m steps} = {j 7 0 ever},
we find fj,0 (m) % fj,0 .
Slide 101
Solving the equation ()
Now let {gj,0 } be any nonnegative solution to
X
gj,0 = pj0 +
pjk gk,0 .
k6=0
We show by induction that fj,0 (m) gj,0 for all m. Clearly this is
true for m = 1. Assume that it is true for m = `. Then
X
fj,0 (` + 1) = pj0 +
pjk fk,0 (`)
k6=0
pj0 +
pjk gk,0
k6=0
= gj,0 .
It follows that fj,0 = limm fj,0 (m) gj,0 and so {fj,0 } is the
minimal nonnegative solution to ().
Slide 102
Solving the equation ()
For the random walk with (1 p)/p < 1, the general solution for
j 1 was of the form
fj,0 = A + B
1p
p
j
.
The minimal nonnegative solution for j > 0 is
fj,0 =
1p
p
j
.
and f0,0 = 2(1 p).
Slide 103
The Gamblers Ruin Problem
I
Denote the initial capital of a gambler by N.
The gambler will stop playing if he/she wins $M or loses
his/her initial stake of $N.
There is a probability p that the gambler wins $1 and a
probability 1 p that he/she loses $1 on each game.
We assume that the outcomes of successive plays are
independent.
This is a simple DTMC with a finite state space {N, . . . , M}
and transition probabilities pj,j+1 = p and pj,j1 = 1 p for
j {N + 1, . . . , M 1}, and pN,N = pM,M = 1.
The gambler would like to know the probability that he/she will
win $M before becoming bankrupt.
Slide 104
The Gamblers Ruin Problem
We use () to calculate the probability that the gamblers ruin
DTMC hits N. For N + 1 j M 1, we have
fj,N = pfj+1,N + (1 p)fj1,N
with fN,N = 1 and fM,N = 0.
When p 6= 1/2, the general solution to the first equation is again
fj,N = A + B
1p
p
j
.
(For the same range of j.)
Slide 105
The Gamblers Ruin Problem
The upper boundary condition gives us
A = B
1p
p
M
,
and the lower boundary condition gives us
B=
1p
p
N
1p
p
M !1
,
So the general solution is
fj,N =
1p
p
1p
p
j
N
1p
p
M
1p
p
M .
Slide 106
The Gamblers Ruin Problem
When p = 1/2, the general solution to the first equation is
fj,N = A + Bj.
The upper boundary condition gives us
A = BM,
and the lower boundary condition gives us
B=
1
,
M +N
So the general solution is
fj,N =
M j
.
M +N
Slide 107
The Gamblers Ruin Problem
The expected gain is E (G ) = M (N + M)f0,N
Here are some numbers:
I
If N = 90, M = 10 and p = 0.5 then f0,N = 0.1.
If N = 90, M = 10 and p = 0.45 then f0,N = 0.866.
If N = 90, M = 10 and p = 0.4 then f0,N = 0.983.
If N = 99, M = 1 and p = 0.4 then f0,N = 0.333.
Slide 108
Long run behaviour of DTMCs
We want to know the proportion of time a DTMC spends in each
state over the long run (if this concept makes sense) which should
(n)
be the same as the limiting probabilities limn pkj .
I
These will be zero for transient states and non-essential states.
For an irreducible and recurrent DTMC, we will see that these
limiting probabilities exist and are even independent of k.
Slide 109
Long run behaviour of DTMCs
Recall that we used Ti (j) to denote the time between the ith and
(i 1)st return to state j. We then defined state j (and hence its
communicating class) to be
I
transient if Ti (j) = with positive probability, and
recurrent if Ti (j) < with probability one.
There is a further classification of recurrent states. Specifically, j is
I
null-recurrent if E [Ti (j)] = , and
positive-recurrent if E [Ti (j)] < .
This classification is important for the calculation of the limiting
probabilities.
Slide 110
Long run behaviour of DTMCs
Examples
I
The symmetric random walk with p = q = 1/2. For all j,
Ti (j) < with probability one, but E [Ti (j)] = . That is,
all states are null-recurrent.
A finite irreducible DTMC: E [Ti (j)] < for all j.
Slide 111
Long run behaviour of DTMCs
In the long run, how often does a DTMC visit a state j?
Let j E [T1 (j)|X0 = j] < By the Law of Large Numbers,
T1 (j) + T2 (j) + + Tk (j) j k. So there are approximately k
visits in kj time-steps, and the relative frequency of visits to j is
1/j . This leads us to
Theorem:
If j is an aperiodic state in a positive recurrent
communicating class: j = E [T1 (j)|X0 = j] < , then
(n)
lim pjj =
1
.
j
In the null recurrent or transient case where j = :
(n)
lim p
n jj
= 0.
Slide 112
Long run behaviour of DTMCs
Why do we need aperiodicity?
P=
0 1
1 0
.
Slide 113
Long run behaviour of DTMCs
Further:
Theorem:
if i j are aperiodic and the communication class
is positive recurrent: j = E [T1 (j)|X0 = j] < , then
(n)
lim pij =
1
.
j
Note that the right hand side doesnt have an i in it!
Slide 114
Ergodicity and Stationarity
Definition: We call the DTMC ergodic if for all j the limit
(n)
j = limn pij exists and does not depend on i.
For an ergodic DTMC, with limiting distribution = (1 , 2 , . . .),
P
I
j j = 1.
I
For any initial probability distribution 0 ,
0 P n as n .
P = and hence P n = .
Any distribution satisfying the last item is called a stationary
distribution for the DTMC.
Slide 115
Ergodicity and Stationarity
Theorem:
A DTMC {Xn } is ergodic if and only if it is
irreducible, aperiodic and positive recurrent.
In this case there is a unique solution to the system of linear
equations
P =
P
with j j = 1 (that is, a unique stationary distribution) and
moreover j = 1/j .
We often test whether an irreducible, aperiodic DTMC is ergodic
by attempting to solve the equations in the third dot point.
Slide 116
Examples
An m m stochastic matrix P is called doubly-stochastic if all the
column sums are equal to one.
If an aperiodic DTMC has a doubly-stochastic transition matrix,
then we can easily verify that
(1/m, 1/m, 1/m, . . .)P = (1/m, 1/m, 1/m, . . .).
It follows that
= (1/m, 1/m, 1/m, . . .),
and the stationary distribution is uniform on S.
Find a stationary distribution for
0 1
P=
.
1 0
Slide 117
Examples
Find a stationary distribution for
1/2 1/2 0
P = 1/2 1/2 0 .
0
0 1
Find the stationary distribution for
0.8 0.2
0
0
0
0
0.5
0.5
P=
0.75 0.25 0
0
0
0
0.4 0.6
Slide 118
Random walk with one barrier
Give a criterion for ergodicity of the DTMC with state space
{0, 1, 2, } and transition matrix
q
P=
0
..
.
q
..
.
0
..
.
p
..
.
0
..
.
..
..
.
.
..
.
..
.
When the DTMC is ergodic, derive its stationary distribution.
Slide 119
Random walk with one barrier
We saw that this DTMC is irreducible, aperiodic and recurrent
when p q. Solve the linear equations
(0 , 1 , ) = (0 , 1 , )P
k
to get k = (p/q)
P 0 .
We also need k0 k = 1. The sum on the left hand side is finite
only if p < q, in which case 0 = 1 (p/q) and so
k = [1 (p/q)] (p/q)k .
P
So there is a solution to = P with k0 k = 1, and hence
the DTMC is ergodic, only if p < q, in which case
k =
1
.
(p/q)k (1 (p/q))
Slide 120
The distribution
For an irreducible, aperiodic and positive-recurrent DTMC, the
distribution defined by has a number of interpretations.
It can be seen as
I
limiting,
stationary,
ergodic.
Slide 121
The distribution
The limiting interpretation
By definition
(n)
j = lim Pij
n
and so j is the limiting probability that the DTMC is in state j.
This means that so long as the DTMC has been going for quite a
long time, the probability that it is in state j will be approximately
j .
Slide 122
The distribution
The stationary interpretation
We showed that
P =
and so has a stationary interpretation. If the DTMC starts with
probability distribution it will persist with this distribution
forever.
Furthermore, the DTMC is strictly stationary in the sense that its
finite-dimensional distributions are invariant. For each k,
(Xm , Xm+1 , . . . , Xm+k ) has the same distribution as
(X0 , X1 , . . . , Xk ), independently of m.
Slide 123
The distribution
The ergodic interpretation
This means that for sample paths of DTMC that constitute a set
of probability one, the proportion of time that the process spends
in state j is j .
This is can be formally stated as a Law of P
Large Numbers. For any
initial distribution, the proportion of time ni=1 I (Xi = j)/n that
the DTMC spends in state j converges to j with probability one
as n .
Slide 124
Reducible DTMC
From non-essential states, the DTMC will eventually leave
forever.
As soon as it enters a class Sr of essential states, it will stay
in Sr forever.
If the stochastic sub-matrix Pr , derived from P by restricting
it to the states of Sr , is aperiodic and positive-recurrent, then
this subchain is ergodic and
(
(r )
j
j Sr
P(Xn = j|X0 Sr )
=0
if j 6 Sr .
Slide 125
Periodic DTMC
If Pr is a periodic subchain with a period d > 1, which is
recurrent with a finite expected recurrence time, then for
0 k d 1, {Xnd+k |X0 Sr } is an ergodic DTMC with
(k)
state space Sr .
For any ` and k = 0, 1, , d 1,
(
P(Xnd+k = j|X0
so
(r )
(`)
jSr
Sr(`) )
(r )
j
=0
as n
(`+k(mod d))
for j Sr
(`+k(mod d))
for j 6 Sr
= 1 for any `
Slide 126
Discrete-Time Markov Chains
Example
Classify the DTMC with
0
1
P=
0
0
0
1
4
0 0.1 0.9 0
0 0
0 0
1 0
0 0
1 0
0 0
0 0
0 13
0 0
0 15
1
0
0 18
4
0
0
0
0
2
3
4
5
1
4
0
0
0
0
0
0
1
8
and discuss its properties.
Slide 127
Good Trick
Sometimes we want to model a physical system where the future
depend on part of the past. Consider following example. A
sequence of random variables {Xn } describes the weather at a
particular location, with Xn = 1 if it is sunny and Xn = 2 if it is
rainy on day n.
Suppose that the weather on day n + 1 depends on the weather
conditions on days n 1 and n as is shown below:
P(Xn+1 = 2|Xn = Xn1 = 2) = 0.6
P(Xn+1 = 1|Xn = Xn1 = 1) = 0.8
P(Xn+1 = 2|Xn = 2, Xn1 = 1) = 0.5
P(Xn+1 = 1|Xn = 1, Xn1 = 2) = 0.75
Slide 128
Good Trick
If we put Yn = (Xn1 , Xn ), then Yn is a DTMC. The possible
states are 10 = (1, 1), 20 = (1, 2), 30 = (2, 1) and 40 = (2, 2).
We see that {Yn : n 1} is a DTMC with transition matrix
0.8 0.2
0
0
0
0
0.5 0.5
.
P=
0.75 0.25 0
0
0
0
0.4 0.6
Slide 129