Deep Reinforcement Learning (Spring 2022)
Assignment 1
March 16, 2022
1 MDPs [15 pts]
A boy is being chased around the school yard by bullies and must choose whether to Fight or
Run.
• There are three states:
– Ok (O), where he is fine for the moment.
– Danger (D), where the bullies are right on his heels.
– Caught (C), where the bullies catch up with him and administer noogies.
• He begins in state O 75% of the time.
• He begins in state D 25% of the time.
The graph of the MDP is given here:
1
1. Fill out the table with the results of value iteration with a discount factor γ = 0.9 [9 pts]:
k V k (O) V k (D) V k (C)
1 2 -1 -5
2
3
2. At k = 2 with γ = 0.9 what policy would you select? Is it necessarily true that this is
the optimal policy? At k = 3 what policy would you select? Is it necessarily true that
this is the optimal policy? [6 pts]
2 Value Iteration Theorem [35 pts]
In this problem, we will deal with contractions and fixed points and prove an important result
from the value iteration theorem. From lecture, we know that the Bellman backup operator
B given below is a contraction with the fixed point as V ∗ , the optimal value function of the
MDP. The symbols have their usual meanings. γ is the discount factor and 0 ≤ γ < 1. In all
parts, ||v|| is the infinity norm of the vector.
X
(BV )(s) = max[R(s, a) + γ p(s0 |s, a)V (s0 )] (1)
a
s0 ∈S
We also saw the contraction operator Bπ which is the Bellman backup operator for a
particular policy given below:
X
(Bπ V )(s) = Ea∼π [R(s, a) + γ p(s0 |s, a)V (s0 )] (2)
s0 ∈S
(a) Recall that ||BV − BV 0 || ≤ γ||V − V 0 || for two random value functions V and V 0 . Prove
that Bπ is also a contraction mapping: ||Bπ V − Bπ V 0 || ≤ γ||V − V 0 ||. [5 pts]
(b) Prove that the fixed point for Bπ is unique. What is the fixed point of Bπ ? [5 pts]
In value iteration, we repeatedly apply the Bellman backup operator B to improve our value
function. At the end of value iteration, we can recover a greedy policy π from the value function
using the equation below:
X
π(s) = arg max[r(s, a) + γ p(s0 |s, a)V (s0 )] (3)
a
s0 ∈S
Suppose we run value iteration for a finite number of steps to obtain a value function V (V
has not necessarily converged to V ∗ ). Say now that we evaluate our policy π obtained using
the formula above to get V π . Note that here and for the rest of Q2, π refers to the
greedy policy.
(c) Is V π always the same as V ? Justify your answer. [5 pts]
2
In lecture, we learned that running value iteration until a certain tolerance can bring us close
to recovering the optimal value function. Let Vn and Vn+1 be the outputs of value iteration at
the nth and n + 1th iterations respectively. Let > 0 and consider the point in value iteration
such that ||Vn+1 − Vn || < (1−γ)
2γ
. Let π be the greedy policy given the value function Vn+1 .
You will now prove that this policy π is -optimal. This result justifies why halting value
iteration when the difference between success iterations is sufficiently small, ensures the decision
policy obtained by being greedy with respect to the value function, is near-optimal.
Precisely if
(1 − γ)
||Vn+1 − Vn || < (4)
2γ
then,
||V π − V ∗ || ≤ (5)
(d) When π is the greedy policy, what is the relationship between B and Bπ ? [2 pts]
(e) Prove that ||V π − Vn+1 || ≤ /2.
Hint: Introduce an in-between term and leverage the triangle inequality. [6 pts]
(f) Prove ||B k V − B k V 0 || ≤ γ k ||V − V 0 || [3 pts]
(g) Prove that ||V ∗ − Vn+1 || ≤ /2. [7pts]
Hints: Note that ||V ∗ − Vn+1 || = ||V ∗ + Vn+2 − Vn+2 − Vn+1 || and you can repeatedly
apply this trick. It may also be useful to leverage part (f) and recall that V ∗ is the fixed
point of the contraction B.
(h) Use the results from parts (e) and (g), to show that ||V π − V ∗ || ≤ [2 pts]
3 Simulation Lemma and Model-based Learning [25 pts]
Consider two finite horizon MDPs M := {S, A, H, rh , Ph , s0 } and M̃ := {S, A, H, rh , P̃h , s0 }.
Consider an arbitrary stochastic stationary policy π : S → ∆(A). In finite horizon MDPs,
state value is also a function of steps the agent have taken so far, and we denote Vhπ (s) as the
expected value following π starting from s at step h. We define V π = V0π (s0 ) as the expected
total reward of π under M, and Ṽ π = Ṽ0π (s0 ) as the expected total reward of π under M̃ . We
denote Pπh as the state-action distribution of π under M at step h.
(a) Prove the following equality:
H−1
X h i
π π π 0 π 0
V − Ṽ = Es,a∼Pπ
h
Es0 ∼Ph (·|s,a) Ṽh+1 (s ) − Es0 ∼P̃h (·|s,a) Ṽh+1 (s )
h=0
π
Here Ṽh+1 (s0 ) is the expected total reward of π under M̃ from time h + 1.[10 Points]
(b) Imagine the following situation: we have M as the true real MDP, and we have M̃ as
some learned model that supposes to approximate the real MDP M. Given M̃, the
natural thing to do is to compute the optimal policy under M̃, i.e.,
π̃ ? = arg max Ṽ π ,
π∈Π
3
where Π ⊂ {π : S → ∆(A)} is a pre-defined policy class. Let us also denote the true
optimal policy π ? under the real model as:
π ? = arg max V π .
π∈Π
A natural question is that what is the performance of π̃ ? under the real model M,
compared to π ? under M? To answer this question, let’s assume r(s, a) ∈ [0, 1] for all
s, a and prove the following inequality:
H−1
Xh i
π? π̃ ?
V −V ≤H E
s,a∼Pπ
h
? kP̃h (·|s, a) − Ph (·|s, a)k1 + E
s,a∼Pπ̃
h
? kP̃h (·|s, a) − Ph (·|s, a)k1
h=0
R
(Hint: |f (x)g(x)|dx ≤ kf (·)k1 kg(·)k∞ ) [15 Points]
4 Frozen Lake MDP [25 Points]
Now you will implement value iteration and policy iteration for the Frozen Lake environment
from OpenAI Gym. We have provided custom versions of this environment in the starter code.
(a) (coding) Read through vi_and_pi.py and implement policy_evaluation,
policy_improvement and policy_iteration. The stopping tolerance (defined as
maxs |Vold (s) − Vnew (s)|) is tol = 10−3 . Use γ = 0.9. Return the optimal value function
and the optimal policy. [10 pts]
(b) (coding) Implement value_iteration in vi_and_pi.py. The stopping tolerance is
tol = 10−3 . Use γ = 0.9. Return the optimal value function and the optimal policy. [10
pts]
(c) (written) Run both methods on the Deterministic-4x4-FrozenLake-v0 and
Stochastic-4x4-FrozenLake-v0 environments. In the second environment, the dynam-
ics of the world are stochastic. How does stochasticity affect the number of iterations
required, and the resulting policy? [5 pts]