KEMBAR78
Cs229-Notes12 Reinforcement in Control | PDF | Applied Mathematics | Algorithms
0% found this document useful (0 votes)
25 views17 pages

Cs229-Notes12 Reinforcement in Control

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views17 pages

Cs229-Notes12 Reinforcement in Control

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

CS229 Lecture notes

Andrew Ng

Part XIII
Reinforcement Learning and
Control
We now begin our study of reinforcement learning and adaptive control.
In supervised learning, we saw algorithms that tried to make their outputs
mimic the labels y given in the training set. In that setting, the labels gave
an unambiguous “right answer” for each of the inputs x. In contrast, for
many sequential decision making and control problems, it is very difficult to
provide this type of explicit supervision to a learning algorithm. For example,
if we have just built a four-legged robot and are trying to program it to walk,
then initially we have no idea what the “correct” actions to take are to make
it walk, and so do not know how to provide explicit supervision for a learning
algorithm to try to mimic.
In the reinforcement learning framework, we will instead provide our al-
gorithms only a reward function, which indicates to the learning agent when
it is doing well, and when it is doing poorly. In the four-legged walking
example, the reward function might give the robot positive rewards for mov-
ing forwards, and negative rewards for either moving backwards or falling
over. It will then be the learning algorithm’s job to figure out how to choose
actions over time so as to obtain large rewards.
Reinforcement learning has been successful in applications as diverse as
autonomous helicopter flight, robot legged locomotion, cell-phone network
routing, marketing strategy selection, factory control, and efficient web-page
indexing. Our study of reinforcement learning will begin with a definition of
the Markov decision processes (MDP), which provides the formalism in
which RL problems are usually posed.

1
2

1 Markov decision processes


A Markov decision process is a tuple (S, A, {Psa }, γ, R), where:

• S is a set of states. (For example, in autonomous helicopter flight, S


might be the set of all possible positions and orientations of the heli-
copter.)

• A is a set of actions. (For example, the set of all possible directions in


which you can push the helicopter’s control sticks.)

• Psa are the state transition probabilities. For each state s ∈ S and
action a ∈ A, Psa is a distribution over the state space. We’ll say more
about this later, but briefly, Psa gives the distribution over what states
we will transition to if we take action a in state s.

• γ ∈ [0, 1) is called the discount factor.

• R : S × A 7→ R is the reward function. (Rewards are sometimes also


written as a function of a state S only, in which case we would have
R : S 7→ R).

The dynamics of an MDP proceeds as follows: We start in some state s0 ,


and get to choose some action a0 ∈ A to take in the MDP. As a result of our
choice, the state of the MDP randomly transitions to some successor state
s1 , drawn according to s1 ∼ Ps0 a0 . Then, we get to pick another action a1 .
As a result of this action, the state transitions again, now to some s2 ∼ Ps1 a1 .
We then pick a2 , and so on…. Pictorially, we can represent this process as
follows:
a0 a1 a2 a3
s0 −→ s1 −→ s2 −→ s3 −→ ...
Upon visiting the sequence of states s0 , s1 , . . . with actions a0 , a1 , . . ., our
total payoff is given by

R(s0 , a0 ) + γR(s1 , a1 ) + γ 2 R(s2 , a2 ) + · · · .

Or, when we are writing rewards as a function of the states only, this becomes

R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + · · · .

For most of our development, we will use the simpler state-rewards R(s),
though the generalization to state-action rewards R(s, a) offers no special
difficulties.
3

Our goal in reinforcement learning is to choose actions over time so as to


maximize the expected value of the total payoff:
 
E R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + · · ·
Note that the reward at timestep t is discounted by a factor of γ t . Thus, to
make this expectation large, we would like to accrue positive rewards as soon
as possible (and postpone negative rewards as long as possible). In economic
applications where R(·) is the amount of money made, γ also has a natural
interpretation in terms of the interest rate (where a dollar today is worth
more than a dollar tomorrow).
A policy is any function π : S 7→ A mapping from the states to the
actions. We say that we are executing some policy π if, whenever we are
in state s, we take action a = π(s). We also define the value function for
a policy π according to

V π (s) = E R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + · · · s0 = s, π].
V π (s) is simply the expected sum of discounted rewards upon starting in
state s, and taking actions according to π.1
Given a fixed policy π, its value function V π satisfies the Bellman equa-
tions: X
V π (s) = R(s) + γ Psπ(s) (s′ )V π (s′ ).
s′ ∈S

This says that the expected sum of discounted rewards V π (s) for starting
in s consists of two terms: First, the immediate reward R(s) that we get
right away simply for starting in state s, and second, the expected sum of
future discounted rewards. Examining the second term in more detail, we
see that the summation term above can be rewritten Es′ ∼Psπ(s) [V π (s′ )]. This
is the expected sum of discounted rewards for starting in state s′ , where s′
is distributed according Psπ(s) , which is the distribution over where we will
end up after taking the first action π(s) in the MDP from state s. Thus, the
second term above gives the expected sum of discounted rewards obtained
after the first step in the MDP.
Bellman’s equations can be used to efficiently solve for V π . Specifically,
in a finite-state MDP (|S| < ∞), we can write down one such equation for
V π (s) for every state s. This gives us a set of |S| linear equations in |S|
variables (the unknown V π (s)’s, one for each state), which can be efficiently
solved for the V π (s)’s.
1
This notation in which we condition on π isn’t technically correct because π isn’t a
random variable, but this is quite standard in the literature.
4

We also define the optimal value function according to


V ∗ (s) = max V π (s). (1)
π

In other words, this is the best possible expected sum of discounted rewards
that can be attained using any policy. There is also a version of Bellman’s
equations for the optimal value function:
X
V ∗ (s) = R(s) + max γ Psa (s′ )V ∗ (s′ ). (2)
a∈A
s′ ∈S

The first term above is the immediate reward as before. The second term
is the maximum over all actions a of the expected future sum of discounted
rewards we’ll get upon after action a. You should make sure you understand
this equation and see why it makes sense. (A derivation for equation (2) and
the equation (3) below are given in Appendix B)
We also define a policy π ∗ : S 7→ A as follows:
X
π ∗ (s) = arg max Psa (s′ )V ∗ (s′ ). (3)
a∈A
s′ ∈S

Note that π ∗ (s) gives the action a that attains the maximum in the “max”
in Equation (2).
It is a fact that for every state s and every policy π, we have

V ∗ (s) = V π (s) ≥ V π (s).

The first equality says that the V π , the value function for π ∗ , is equal to the
optimal value function V ∗ for every state s. Further, the inequality above
says that π ∗ ’s value is at least a large as the value of any other other policy.
In other words, π ∗ as defined in Equation (3) is the optimal policy.
Note that π ∗ has the interesting property that it is the optimal policy for
all states s. Specifically, it is not the case that if we were starting in some
state s then there’d be some optimal policy for that state, and if we were
starting in some other state s′ then there’d be some other policy that’s opti-
mal policy for s′ . The same policy π ∗ attains the maximum in Equation (1)
for all states s. This means that we can use the same policy π ∗ no matter
what the initial state of our MDP is.

2 Value iteration and policy iteration


We now describe two efficient algorithms for solving finite-state MDPs. For
now, we will consider only MDPs with finite state and action spaces (|S| <
5

∞, |A| < ∞). In this section, we will also assume that we know the state
transition probabilities {Psa } and the reward function R.
The first algorithm, value iteration, is as follows:

Algorithm 1 Value Iteration


1: For each state s, initialize V (s) := 0.
2: for until convergence do
3: For every state, update
X
V (s) := R(s) + max γ Psa (s′ )V (s′ ). (4)
a∈A
s′

This algorithm can be thought of as repeatedly trying to update the


estimated value function using Bellman Equations (2).
There are two possible ways of performing the updates in the inner loop
of the algorithm. In the first, we can first compute the new values for V (s) for
every state s, and then overwrite all the old values with the new values. This
is called a synchronous update. In this case, the algorithm can be viewed as
implementing a “Bellman backup operator” that takes a current estimate of
the value function, and maps it to a new estimate. (See homework problem
for details.) Alternatively, we can also perform asynchronous updates.
Here, we would loop over the states (in some order), updating the values one
at a time.
Under either synchronous or asynchronous updates, it can be shown that
value iteration will cause V to converge to V ∗ . Having found V ∗ , we can
then use Equation (3) to find the optimal policy.
Apart from value iteration, there is a second standard algorithm for find-
ing an optimal policy for an MDP. The policy iteration algorithm proceeds
as follows:

Algorithm 2 Policy Iteration


1: Initialize π randomly.
2: for until convergence do
3: Let V := V π . ▷ typically by linear system solver
4: For each state s, let
X
π(s) := arg max Psa (s′ )V (s′ ).
a∈A
s′
6

Thus, the inner-loop repeatedly computes the value function for the cur-
rent policy, and then updates the policy using the current value function.
(The policy π found in step (b) is also called the policy that is greedy with
respect to V .) Note that step (a) can be done via solving Bellman’s equa-
tions as described earlier, which in the case of a fixed policy, is just a set of
|S| linear equations in |S| variables.
After at most a finite number of iterations of this algorithm, V will con-
verge to V ∗ , and π will converge to π ∗ .2
Both value iteration and policy iteration are standard algorithms for solv-
ing MDPs, and there isn’t currently universal agreement over which algo-
rithm is better. For small MDPs, policy iteration is often very fats and con-
verges with very few iterations. However, for MDPs with large state spaces,
solving for V π explicitly would involve solving a large system of linear equa-
tions, and could be difficult (and note that one has to solve the linear system
multiple times in policy iteration). In these problems, value iteration may be
preferred. For this reason, in practice value iteration seems to be used more
often than policy iteration. For some more discussions on the comparison
and connection of value iteration and policy iteration, please see Section A.

3 Learning a model for an MDP


So far, we have discussed MDPs and algorithms for MDPs assuming that the
state transition probabilities and rewards are known. In many realistic prob-
lems, we are not given state transition probabilities and rewards explicitly,
but must instead estimate them from data. (Usually, S, A and γ are known.)
For example, suppose that, for the inverted pendulum problem (see prob-
lem set 4), we had a number of trials in the MDP, that proceeded as follows:
(1) (1) (1) (1)
(1) a (1) a (1) a (1) a
s0 −→
0
s1 −→
1
s2 −→
2
s3 −→
3
...
(2) (2) (2) (2)
(2) a (2) a (2) a (2) a
s0 −→
0
s1 −→
1
s2 −→
2
s3 −→
3
...
...
Note that value iteration cannot reach the exact V ∗ in a finite number of iterations,
2

whereas policy iteration with an exact linear system solver, can. This is because when
the actions space and policy space are discrete and finite, and once the policy reaches the
optimal policy in policy iteration, then it will not change at all. On the other hand, even
though value iteration will converge to the V ∗ , but there is always some non-zero error in
the learned value function.
7

(j) (j)
Here, si is the state we were at time i of trial j, and ai is the cor-
responding action that was taken from that state. In practice, each of the
trials above might be run until the MDP terminates (such as if the pole falls
over in the inverted pendulum problem), or it might be run for some large
but finite number of timesteps.
Given this “experience” in the MDP consisting of a number of trials,
we can then easily derive the maximum likelihood estimates for the state
transition probabilities:

#times took we action a in state s and got to s′


Psa (s′ ) = (5)
#times we took action a in state s

Or, if the ratio above is “0/0”—corresponding to the case of never having


taken action a in state s before—the we might simply estimate Psa (s′ ) to be
1/|S|. (I.e., estimate Psa to be the uniform distribution over all states.)
Note that, if we gain more experience (observe more trials) in the MDP,
there is an efficient way to update our estimated state transition probabilities
using the new experience. Specifically, if we keep around the counts for both
the numerator and denominator terms of (5), then as we observe more trials,
we can simply keep accumulating those counts. Computing the ratio of these
counts then given our estimate of Psa .
Using a similar procedure, if R is unknown, we can also pick our estimate
of the expected immediate reward R(s) in state s to be the average reward
observed in state s.
Having learned a model for the MDP, we can then use either value it-
eration or policy iteration to solve the MDP using the estimated transition
probabilities and rewards. For example, putting together model learning and
value iteration, here is one possible algorithm for learning in an MDP with
unknown state transition probabilities:

1. Initialize π randomly.

2. Repeat {

(a) Execute π in the MDP for some number of trials.


(b) Using the accumulated experience in the MDP, update our esti-
mates for Psa (and R, if applicable).
(c) Apply value iteration with the estimated state transition probabil-
ities and rewards to get a new estimated value function V .
(d) Update π to be the greedy policy with respect to V .
8

We note that, for this particular algorithm, there is one simple optimiza-
tion that can make it run much more quickly. Specifically, in the inner loop
of the algorithm where we apply value iteration, if instead of initializing value
iteration with V = 0, we initialize it with the solution found during the pre-
vious iteration of our algorithm, then that will provide value iteration with
a much better initial starting point and make it converge more quickly.

4 Continuous state MDPs


So far, we’ve focused our attention on MDPs with a finite number of states.
We now discuss algorithms for MDPs that may have an infinite number of
states. For example, for a car, we might represent the state as (x, y, θ, ẋ, ẏ, θ̇),
comprising its position (x, y); orientation θ; velocity in the x and y directions
ẋ and ẏ; and angular velocity θ̇. Hence, S = R6 is an infinite set of states,
because there is an infinite number of possible positions and orientations
for the car.3 Similarly, the inverted pendulum you saw in PS4 has states
(x, θ, ẋ, θ̇), where θ is the angle of the pole. And, a helicopter flying in 3d
space has states of the form (x, y, z, ϕ, θ, ψ, ẋ, ẏ, ż, ϕ̇, θ̇, ψ̇), where here the roll
ϕ, pitch θ, and yaw ψ angles specify the 3d orientation of the helicopter.
In this section, we will consider settings where the state space is S = Rd ,
and describe ways for solving such MDPs.

4.1 Discretization
Perhaps the simplest way to solve a continuous-state MDP is to discretize
the state space, and then to use an algorithm like value iteration or policy
iteration, as described previously.
For example, if we have 2d states (s1 , s2 ), we can use a grid to discretize
the state space:
Technically, θ is an orientation and so the range of θ is better written θ ∈ [−π, π) than
3

θ ∈ R; but for our purposes, this distinction is not important.


9

Here, each grid cell represents a separate discrete state s̄. We can then ap-
proximate the continuous-state MDP via a discrete-state one (S̄, A, {Ps̄a }, γ, R),
where S̄ is the set of discrete states, {Ps̄a } are our state transition probabil-
ities over the discrete states, and so on. We can then use value iteration or
policy iteration to solve for the V ∗ (s̄) and π ∗ (s̄) in the discrete state MDP
(S̄, A, {Ps̄a }, γ, R). When our actual system is in some continuous-valued
state s ∈ S and we need to pick an action to execute, we compute the
corresponding discretized state s̄, and execute action π ∗ (s̄).
This discretization approach can work well for many problems. However,
there are two downsides. First, it uses a fairly naive representation for V ∗
(and π ∗ ). Specifically, it assumes that the value function is takes a constant
value over each of the discretization intervals (i.e., that the value function is
piecewise constant in each of the gridcells).
To better understand the limitations of such a representation, consider a
supervised learning problem of fitting a function to this dataset:
5.5

4.5

3.5
y

2.5

1.5
1 2 3 4 5 6 7 8
x
10

Clearly, linear regression would do fine on this problem. However, if we


instead discretize the x-axis, and then use a representation that is piecewise
constant in each of the discretization intervals, then our fit to the data would
look like this:
5.5

4.5

3.5
y

2.5

1.5
1 2 3 4 5 6 7 8
x

This piecewise constant representation just isn’t a good representation for


many smooth functions. It results in little smoothing over the inputs, and no
generalization over the different grid cells. Using this sort of representation,
we would also need a very fine discretization (very small grid cells) to get a
good approximation.
A second downside of this representation is called the curse of dimen-
sionality. Suppose S = Rd , and we discretize each of the d dimensions of the
state into k values. Then the total number of discrete states we have is k d .
This grows exponentially quickly in the dimension of the state space d, and
thus does not scale well to large problems. For example, with a 10d state, if
we discretize each state variable into 100 values, we would have 10010 = 1020
discrete states, which is far too many to represent even on a modern desktop
computer.
As a rule of thumb, discretization usually works extremely well for 1d
and 2d problems (and has the advantage of being simple and quick to im-
plement). Perhaps with a little bit of cleverness and some care in choosing
the discretization method, it often works well for problems with up to 4d
states. If you’re extremely clever, and somewhat lucky, you may even get it
to work for some 6d problems. But it very rarely works for problems any
higher dimensional than that.
11

4.2 Value function approximation


We now describe an alternative method for finding policies in continuous-
state MDPs, in which we approximate V ∗ directly, without resorting to dis-
cretization. This approach, called value function approximation, has been
successfully applied to many RL problems.

4.2.1 Using a model or simulator


To develop a value function approximation algorithm, we will assume that
we have a model, or simulator, for the MDP. Informally, a simulator is
a black-box that takes as input any (continuous-valued) state st and action
at , and outputs a next-state st+1 sampled according to the state transition
probabilities Pst at :

There are several ways that one can get such a model. One is to use
physics simulation. For example, the simulator for the inverted pendulum
in PS4 was obtained by using the laws of physics to calculate what position
and orientation the cart/pole will be in at time t + 1, given the current state
at time t and the action a taken, assuming that we know all the parameters
of the system such as the length of the pole, the mass of the pole, and so
on. Alternatively, one can also use an off-the-shelf physics simulation software
package which takes as input a complete physical description of a mechanical
system, the current state st and action at , and computes the state st+1 of the
system a small fraction of a second into the future.4
An alternative way to get a model is to learn one from data collected in
the MDP. For example, suppose we execute n trials in which we repeatedly
take actions in an MDP, each trial for T timesteps. This can be done picking
actions at random, executing some specific policy, or via some other way of
4
Open Dynamics Engine (http://www.ode.com) is one example of a free/open-source
physics simulator that can be used to simulate systems like the inverted pendulum, and
that has been a reasonably popular choice among RL researchers.
12

choosing actions. We would then observe n state sequences like the following:
(1) (1) (1) (1)
(1) a0 (1) a1 (1) a2 aT −1 (1)
s0 −→ s1 −→ s2 −→ · · · −→ sT
(2) (2) (2) (2)
(2) a (2) a (2) a aT −1 (2)
s0 −→ 0
s1 −→
1
s2 −→
2
· · · −→ sT
···
(n) (n) (n) (n)
(n) a (n) a (n) a aT −1 (n)
s0 −→
0
s1 −→
1
s2 −→
2
· · · −→ sT
We can then apply a learning algorithm to predict st+1 as a function of st
and at .
For example, one may choose to learn a linear model of the form
st+1 = Ast + Bat , (6)
using an algorithm similar to linear regression. Here, the parameters of the
model are the matrices A and B, and we can estimate them using the data
collected from our n trials, by picking
X
n X
T −1   2
(i) (i) (i)
arg min st+1 − Ast + Bat .
A,B 2
i=1 t=0

We could also potentially use other loss functions for learning the model.
For example, it has been found in recent work [?] that using k · k2 norm
(without the square) may be helpful in certain cases.
Having learned A and B, one option is to build a deterministic model,
in which given an input st and at , the output st+1 is exactly determined.
Specifically, we always compute st+1 according to Equation (6). Alterna-
tively, we may also build a stochastic model, in which st+1 is a random
function of the inputs, by modeling it as
st+1 = Ast + Bat + ϵt ,
where here ϵt is a noise term, usually modeled as ϵt ∼ N (0, Σ). (The covari-
ance matrix Σ can also be estimated from data in a straightforward way.)
Here, we’ve written the next-state st+1 as a linear function of the current
state and action; but of course, non-linear functions are also possible. Specif-
ically, one can learn a model st+1 = Aϕs (st ) + Bϕa (at ), where ϕs and ϕa are
some non-linear feature mappings of the states and actions. Alternatively,
one can also use non-linear learning algorithms, such as locally weighted lin-
ear regression, to learn to estimate st+1 as a function of st and at . These
approaches can also be used to build either deterministic or stochastic sim-
ulators of an MDP.
13

4.2.2 Fitted value iteration


We now describe the fitted value iteration algorithm for approximating
the value function of a continuous state MDP. In the sequel, we will assume
that the problem has a continuous state space S = Rd , but that the action
space A is small and discrete.5
Recall that in value iteration, we would like to perform the update
Z
V (s) := R(s) + γ max Psa (s′ )V (s′ )ds′ (7)
a s′
= R(s) + γ max Es′ ∼Psa [V (s′ )] (8)
a

(In Section 2, we had written


P the value iteration update with a summation
V (s) := R(s) + γ maxa s′ Psa (s′ )V (s′ ) rather than an integral over states;
the new notation reflects that we are now working in continuous states rather
than discrete states.)
The main idea of fitted value iteration is that we are going to approxi-
mately carry out this step, over a finite sample of states s(1) , . . . , s(n) . Specif-
ically, we will use a supervised learning algorithm—linear regression in our
description below—to approximate the value function as a linear or non-
linear function of the states:

V (s) = θT ϕ(s).

Here, ϕ is some appropriate feature mapping of the states.


For each state s in our finite sample of n states, fitted value itera-
tion will first compute a quantity y (i) , which will be our approximation
to R(s) + γ maxa Es′ ∼Psa [V (s′ )] (the right hand side of Equation 8). Then,
it will apply a supervised learning algorithm to try to get V (s) close to
R(s) + γ maxa Es′ ∼Psa [V (s′ )] (or, in other words, to try to get V (s) close to
y (i) ).
In detail, the algorithm is as follows:

1. Randomly sample n states s(1) , s(2) , . . . s(n) ∈ S.

2. Initialize θ := 0.
5
In practice, most MDPs have much smaller action spaces than state spaces. E.g., a car
has a 6d state space, and a 2d action space (steering and velocity controls); the inverted
pendulum has a 4d state space, and a 1d action space; a helicopter has a 12d state space,
and a 4d action space. So, discretizing this set of actions is usually less of a problem than
discretizing the state space would have been.
14

3. Repeat {

For i = 1, . . . , n {
For each action a ∈ A {
Sample s′1 , . . . , s′k ∼ Ps(i) a (using a model of the MDP).
P
Set q(a) = k1 kj=1 R(s(i) ) + γV (s′j )
// Hence, q(a) is an estimate of R(s(i) )+γEs′ ∼Ps(i) a [V (s′ )].
}
Set y (i) = maxa q(a).
// Hence, y (i) is an estimate of R(s(i) )+γ maxa Es′ ∼Ps(i) a [V (s′ )].
}
// In the original value iteration algorithm (over discrete states)
// we updated the value function according to V (s(i) ) := y (i) .
// In this algorithm, we want V (s(i) ) ≈ y (i) , which we’ll achieve
// using supervised learning (linear regression).
P 2
Set θ := arg minθ 12 ni=1 θT ϕ(s(i) ) − y (i)

Above, we had written out fitted value iteration using linear regression
as the algorithm to try to make V (s(i) ) close to y (i) . That step of the algo-
rithm is completely analogous to a standard supervised learning (regression)
problem in which we have a training set (x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(n) , y (n) ),
and want to learn a function mapping from x to y; the only difference is that
here s plays the role of x. Even though our description above used linear re-
gression, clearly other regression algorithms (such as locally weighted linear
regression) can also be used.
Unlike value iteration over a discrete set of states, fitted value iteration
cannot be proved to always to converge. However, in practice, it often does
converge (or approximately converge), and works well for many problems.
Note also that if we are using a deterministic simulator/model of the MDP,
then fitted value iteration can be simplified by setting k = 1 in the algorithm.
This is because the expectation in Equation (8) becomes an expectation over
a deterministic distribution, and so a single example is sufficient to exactly
compute that expectation. Otherwise, in the algorithm above, we had to
draw k samples, and average to try to approximate that expectation (see the
definition of q(a), in the algorithm pseudo-code).
15

Finally, fitted value iteration outputs V , which is an approximation to



V . This implicitly defines our policy. Specifically, when our system is in
some state s, and we need to choose an action, we would like to choose the
action
arg max Es′ ∼Psa [V (s′ )] (9)
a

The process for computing/approximating this is similar to the inner-loop of


fitted value iteration, where for each action, we sample s′1 , . . . , s′k ∼ Psa to
approximate the expectation. (And again, if the simulator is deterministic,
we can set k = 1.)
In practice, there are often other ways to approximate this step as well.
For example, one very common case is if the simulator is of the form st+1 =
f (st , at ) + ϵt , where f is some deterministic function of the states (such as
f (st , at ) = Ast + Bat ), and ϵ is zero-mean Gaussian noise. In this case, we
can pick the action given by

arg max V (f (s, a)).


a

In other words, here we are just setting ϵt = 0 (i.e., ignoring the noise in
the simulator), and setting k = 1. Equivalent, this can be derived from
Equation (9) using the approximation

Es′ [V (s′ )] ≈ V (Es′ [s′ ]) (10)


= V (f (s, a)), (11)

where here the expectation is over the random s′ ∼ Psa . So long as the noise
terms ϵt are small, this will usually be a reasonable approximation.
However, for problems that don’t lend themselves to such approximations,
having to sample k|A| states using the model, in order to approximate the
expectation above, can be computationally expensive.

A Connections between Policy and Value It-


eration (Optional)
In the policy iteration, line 3 of Algorithm 2, we typically use linear system
solver to compute V π . Alternatively, one can also the iterative Bellman
updates, similarly to the value iteration, to evaluate V π , as in the Procedure
VE(·) in Line 1 of Algorithm 3 below. Here if we take option 1 in Line 2 of
the Procedure VE, then the difference between the Procedure VE from the
16

Algorithm 3 Variant of Policy Iteration


1: procedure VE(π, k) ▷ To evaluate V π
2: Option 1: initialize V (s) := 0; Option 2: Initialize from the current
V in the main algorithm.
3: for i = 0 to k − 1 do
4: For every state s, update
X
V (s) := R(s) + γ Psπ(s) (s′ )V (s′ ). (12)
s′

return V
Require: hyperparameter k.
5: Initialize π randomly.
6: for until convergence do
7: Let V = VE(π, k).
8: For each state s, let
X
π(s) := arg max Psa (s′ )V (s′ ). (13)
a∈A
s′

value iteration (Algorithm 1) is that on line 4, the procedure is using the


action from π instead of the greedy action.
Using the Procedure VE, we can build Algorithm 3, which is a variant of
policy iteration that serves an intermediate algorithm that connects policy
iteration and value iteration. Here we are going to use option 2 in VE to
maximize the re-use of knowledge learned before. One can verify indeed that
if we take k = 1 and use option 2 in Line 2 in Algorithm 3, then Algorithm 3
is semantically equivalent to value iteration (Algorithm 1). In other words,
both Algorithm 3 and value iteration interleave the updates in (13) and (12).
Algorithm 3 alternate between k steps of update (12) and one step of (13),
whereas value iteration alternates between 1 steps of update (12) and one
step of (13). Therefore generally Algorithm 3 should not be faster than value
iteration, because assuming that update (12) and (13) are equally useful and
time-consuming, then the optimal balance of the update frequencies could
be just k = 1 or k ≈ 1.
On the other hand, if k steps of update (12) can be done much faster than
k times a single step of (12), then taking additional steps of equation (12)
in group might be useful. This is what policy iteration is leveraging — the
linear system solver can give us the result of Procedure VE with k = ∞ much
17

faster than using the Procedure VE for a large k. On the flip side, when such
a speeding-up effect no longer exists, e.g.„ when the state space is large and
linear system solver is also not fast, then value iteration is more preferable.

B Derivations for Bellman Equations


Here we give a derivation for the Bellman Equation given in Section 1. Recall
that the value function for a policy π is defined as

V π (s) = E R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + · · · s0 = s, π].

Therefore, we have

V π (s) = E R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + · · · s0 = s, π]
= R(s) + γE [R(s1 ) + γR(s2 ) + · · · ]
= R(s) + γEs1 ∼Psπ(s) [R(s1 ) + γR(s2 ) + · · · ]
= R(s) + γEs1 ∼Psπ(s) [V π (s1 )] ,

Now we derive the Bellman Equation for the optimal value function.

V ∗ (s) = max V π (s)


π
!
X
= max R(s) + γ Psπ(s) (s′ )V π (s′ )
π
s′ ∈S
!
X
= R(s) + max γ Psπ(s) (s′ )V π (s′ )
π
s′ ∈S
X
= R(s) + max γ Psa (s′ ) max V π (s′ )
a π
s′ ∈S
X
= R(s) + max γ Psa (s′ )V ∗ (s′ ).
a∈A
s′ ∈S

Here the fourth equality is because that for MDP, the optimal action at a
later state is independent of actions at previous states, hence the optimal
policy at the current state can be decomposed to an action followed by the
optimal policy at the new state.

You might also like