0% found this document useful (0 votes)

154 views34 pages

MDP Solution Methods: Iteration & LP

This document discusses Markov Decision Processes and exact solution methods for solving them, including value iteration, policy iteration, and linear programming. Value iteration works by iteratively updating value estimates for each state until convergence. Policy iteration alternates between policy evaluation to calculate value functions for the current policy and policy improvement to find a better policy based on the value function. Linear programming can also be used to find the optimal value function and policy by formulating the problem as a linear program. The document provides examples of Markov Decision Processes and uses a grid world example to illustrate value iteration.

Uploaded by

ThườngNhẫn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

154 views34 pages

MDP Solution Methods: Iteration & LP

Uploaded by

ThườngNhẫn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Markov Decision Processes

and
Exact Solution Methods:
Value Iteration
Policy Iteration
Linear Programming

Pieter Abbeel
UC Berkeley EECS

Markov Decision Process

Assumption: agent gets to observe the state

[Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998]

Markov Decision Process (S, A, T, R, H)

Given
n

S: set of states

A: set of actions

T: S x A x S x {0,1,,H} [0,1],

Tt(s,a,s) = P(st+1 = s | st = s, at =a)

R: S x A x S x {0, 1, , H} <

Rt(s,a,s) = reward for (st+1 = s, st = s, at =a)

H: horizon over which the agent will act

Goal:
n

Find : S x {0, 1, , H} A that maximizes expected sum of rewards, i.e.,

Examples
MDP (S, A, T, R, H),
q

Cleaning robot

Walking robot

Pole balancing

Games: tetris, backgammon

Server management

Shortest path problems

Model for animals, people

goal:

Canonical Example: Grid World

The agent lives in a grid
Walls block the agents path
The agents actions do not
always go as planned:
80% of the time, the action North
takes the agent North
(if there is no wall there)
10% of the time, North takes the
agent West; 10% East
If there is a wall in the direction
the agent would have been taken,
the agent stays put

Big rewards come at the end

Solving MDPs
n

In an MDP, we want an optimal policy *: S x 0:H A

A policy gives an action for each state for each time

t=5=H
t=4
t=3
t=2
t=1
t=0

An optimal policy maximizes expected sum of rewards

Contrast: In deterministic, want an optimal plan, or sequence of actions,

from start to a goal

Outline
n

Optimal Control
=
given an MDP (S, A, T, R, , H)
find the optimal policy *

Exact Methods:
n

Value Iteration

Policy Iteration

Linear Programming

For now: discrete state-action spaces as they are simpler to get the main
concepts across. Will consider continuous spaces later!

Value Iteration
n

Algorithm:
n
n

Start with

for all s.

For i=1, , H
Given Vi*, calculate for all states s 2 S:

This is called a value update or Bellman update/back-up

= the expected sum of rewards accumulated when

starting from state s and acting optimally for a horizon of i steps

Value Iteration in Gridworld

noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld

noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld

noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld

noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld

noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld

noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld

noise = 0.2, =0.9, two terminal states with R = +1 and -1

Exercise 1: Effect of discount, noise

(a) Prefer the close exit (+1), risking the cliff (-10)

(1) = 0.1, noise = 0.5

(b) Prefer the close exit (+1), but avoiding the cliff (-10)

(2) = 0.99, noise = 0

(3) = 0.99, noise = 0.5

(d) Prefer the distant exit (+10), avoiding the cliff (-10)

(4) = 0.1, noise = 0

Exercise 1 Solution

(a) Prefer close exit (+1), risking the cliff (-10) --- = 0.1, noise = 0

Exercise 1 Solution

(b) Prefer close exit (+1), avoiding the cliff (-10) -- = 0.1, noise = 0.5

Exercise 1 Solution

(d) Prefer distant exit (+1), avoid the cliff (-10) -- = 0.99, noise = 0.5

Value Iteration Convergence

Theorem. Value iteration converges. At convergence, we have found
the optimal value function V* for the discounted infinite horizon
problem, which satisfies the Bellman equations

Now we know how to act for infinite horizon with discounted rewards!
Run value iteration till convergence.
This produces V*, which in turn tells us how to act, namely following:

Note: the infinite horizon optimal policy is stationary, i.e., the optimal action at
a state s is the same action at all times. (Efficient to store!)
25

Convergence and Contractions

Define the max-norm:

Theorem: For any two approximations U and V

I.e. any distinct approximations must get closer to each other,

so, in particular, any approximation must get closer to the true U
and value iteration converges to a unique, stable, optimal
solution

Theorem:

I.e. once the change in our approximation is small, it must also

be close to correct
26

Outline
n

Optimal Control
=
given an MDP (S, A, T, R, , H)
find the optimal policy *

Exact Methods:
n

Value Iteration

Policy Iteration

Linear Programming

For now: discrete state-action spaces as they are simpler to get the main
concepts across. Will consider continuous spaces later!

Policy Evaluation
n

Recall value iteration iterates:

Policy evaluation:

At convergence:

Exercise 2

Policy Iteration
n

Alternative approach:
n

Step 1: Policy evaluation: calculate utilities for some

fixed policy (not optimal utilities!) until convergence
Step 2: Policy improvement: update policy using onestep look-ahead with resulting converged (but not
optimal!) utilities as future values
Repeat steps until policy converges

This is policy iteration

Its still optimal!

Can converge faster under some conditions

Policy Evaluation Revisited

Idea 1: modify Bellman updates

Idea 2: it s just a linear system, solve with

Matlab (or whatever),
variables: V(s),
constants: T, R

Policy Iteration Guarantees

Policy Iteration iterates over:

Theorem. Policy iteration is guaranteed to converge and at convergence, the current policy
and its value function are the optimal policy and the optimal value function!
Proof sketch:
(1) Guarantee to converge: In every step the policy improves. This means that a given policy can be
encountered at most once. This means that after we have iterated as many times as there are different
policies, i.e., (number actions)(number states), we must be done and hence have converged.
(2) Optimal at convergence: by definition of convergence, at convergence k+1(s) = k(s) for all states s.
This means
Hence
satisfies the Bellman equation, which means
is equal to the optimal value function V*.
34

Outline
n

Optimal Control
=
given an MDP (S, A, T, R, , H)
find the optimal policy *

Exact Methods:
n

Value Iteration

Policy Iteration

Linear Programming

For now: discrete state-action spaces as they are simpler to get the main
concepts across. Will consider continuous spaces later!

Infinite Horizon Linear Program

Recall, at value iteration convergence we have

LP formulation to find V*:

0 is a probability distribution over S, with 0(s)> 0 for all s 2 S.

Theorem. V* is the solution to the above LP.

Theorem Proof

Dual Linear Program

Interpretation:
n

Equation 2: ensures has the above meaning

Equation 1: maximize expected discounted sum of rewards

Optimal policy:

Outline
n

Optimal Control
=
given an MDP (S, A, T, R, , H)
find the optimal policy *

Exact Methods:
n

Value Iteration

Policy Iteration

Linear Programming

For now: discrete state-action spaces as they are simpler to get the main
concepts across. Will consider continuous spaces later!

Today and forthcoming lectures

Optimal control: provides general computational approach to tackle control

problems.
n

Dynamic programming / Value iteration

n
n
n
n
n
n

Optimal Control through Nonlinear Optimization

n
n

Exact methods on discrete state spaces (DONE!)

Discretization of continuous state spaces
Function approximation
Linear systems
LQR
Extensions to nonlinear settings:
n Local linearization
n Differential dynamic programming
Open-loop
Model Predictive Control

Examples:

2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
AI Exam Prep for CS Students
No ratings yet
AI Exam Prep for CS Students
4 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Lec 09
No ratings yet
Lec 09
51 pages
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
No ratings yet
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
5 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
18 - Dynamic Programming For Markov Decision Processes
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
50 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
2025 - MDPs - Part 2
No ratings yet
2025 - MDPs - Part 2
41 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
15 MDP
No ratings yet
15 MDP
35 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
Markov Decision Processes: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
123 pages
Lec 08
No ratings yet
Lec 08
59 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
RL Unit-4
No ratings yet
RL Unit-4
18 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
CS415 - Lecture 21 - MDPs I
No ratings yet
CS415 - Lecture 21 - MDPs I
49 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
06 MDP
No ratings yet
06 MDP
89 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
Class Notes 2
No ratings yet
Class Notes 2
6 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
M 2
No ratings yet
M 2
12 pages
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
No ratings yet
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
10 pages
Module 04
No ratings yet
Module 04
63 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
CS229
No ratings yet
CS229
17 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Reinforcement Learning 3 Recap
No ratings yet
Reinforcement Learning 3 Recap
3 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Basic-Econometrics FOR BAP Dec 2024
No ratings yet
Basic-Econometrics FOR BAP Dec 2024
3 pages
Kritik Jurnal Kuantitatif Survey: Ns. Yufitriana Amir., MSC., PHD., Fisqua Dosen Ilmu Keperawatan Universitas Riau
No ratings yet
Kritik Jurnal Kuantitatif Survey: Ns. Yufitriana Amir., MSC., PHD., Fisqua Dosen Ilmu Keperawatan Universitas Riau
18 pages
Statistics For Health Data Science An Organic Approach PDF
No ratings yet
Statistics For Health Data Science An Organic Approach PDF
16 pages
Review - Chapter 10 Questions & Answers
No ratings yet
Review - Chapter 10 Questions & Answers
3 pages
Krishnan - at - Al: Sanjeet Jakati 02/04/2021
No ratings yet
Krishnan - at - Al: Sanjeet Jakati 02/04/2021
8 pages
BUSI 2013 Unit 1-10 Notes
No ratings yet
BUSI 2013 Unit 1-10 Notes
10 pages
1994 - Recent Textbooks On Game Theory
No ratings yet
1994 - Recent Textbooks On Game Theory
16 pages
You Are Given The Following Information About An Invertible ARMA Time-Series Model: 0 4 0 2 3 4 - , ,, ,..
No ratings yet
You Are Given The Following Information About An Invertible ARMA Time-Series Model: 0 4 0 2 3 4 - , ,, ,..
40 pages
B FINALS Econometrics-II MCQs
No ratings yet
B FINALS Econometrics-II MCQs
7 pages
Emicos/Cursos Dic-Ta - Dos/Pucp/Qlab/Clase/"Pd1Enc - Dfu Emicos/Cursos Dic - Ta - Dos/Pucp/Qlab/Clase/"Rerunfilecheck - CFG
No ratings yet
Emicos/Cursos Dic-Ta - Dos/Pucp/Qlab/Clase/"Pd1Enc - Dfu Emicos/Cursos Dic - Ta - Dos/Pucp/Qlab/Clase/"Rerunfilecheck - CFG
49 pages
Portfolio Selection Index Models
No ratings yet
Portfolio Selection Index Models
34 pages
Introduction to Factorial Designs (Thiết kế giai thừa)
No ratings yet
Introduction to Factorial Designs (Thiết kế giai thừa)
29 pages
Chapter 6 - Option Pricing - 2022 - S
No ratings yet
Chapter 6 - Option Pricing - 2022 - S
32 pages
C-Cola Sales Regression Analysis
No ratings yet
C-Cola Sales Regression Analysis
5 pages
Moderation Implied An Interaction Effect, Where Introducing A Moderating Variable
No ratings yet
Moderation Implied An Interaction Effect, Where Introducing A Moderating Variable
11 pages
Chapter 5: Statistical Aspects of Regression: and Are Only Estimates of and
No ratings yet
Chapter 5: Statistical Aspects of Regression: and Are Only Estimates of and
21 pages
Stata Panel Data Analysis Guide
0% (1)
Stata Panel Data Analysis Guide
40 pages
RCBD in Agricultural Experiments
No ratings yet
RCBD in Agricultural Experiments
23 pages
2071stat PDF
No ratings yet
2071stat PDF
2 pages
Bayesian Econometric Model Analysis
No ratings yet
Bayesian Econometric Model Analysis
17 pages
Assign-Probability Distribution Models
No ratings yet
Assign-Probability Distribution Models
28 pages
Bendezu Salavtierra
No ratings yet
Bendezu Salavtierra
15 pages
Machine Learning
No ratings yet
Machine Learning
16 pages
STA3030F - Jan 2015 PDF
No ratings yet
STA3030F - Jan 2015 PDF
13 pages
Statistical Inference Basics
No ratings yet
Statistical Inference Basics
18 pages
Quiz M4
No ratings yet
Quiz M4
4 pages
Tolerance Analysis Example
No ratings yet
Tolerance Analysis Example
10 pages
Roommates Voting Three Roommates Need To Vote On Whether They Will Adopt A New Rule and Clean
No ratings yet
Roommates Voting Three Roommates Need To Vote On Whether They Will Adopt A New Rule and Clean
1 page
Using Econometrics A Practical Guide Seventh Edition Global Edition Studenmund Download
No ratings yet
Using Econometrics A Practical Guide Seventh Edition Global Edition Studenmund Download
63 pages

MDP Solution Methods: Iteration & LP

Uploaded by

MDP Solution Methods: Iteration & LP

Uploaded by

Markov Decision Processes

Markov Decision Process

Assumption: agent gets to observe the state

Markov Decision Process (S, A, T, R, H)

Tt(s,a,s) = P(st+1 = s | st = s, at =a)

Rt(s,a,s) = reward for (st+1 = s, st = s, at =a)

H: horizon over which the agent will act

Find : S x {0, 1, , H} A that maximizes expected sum of rewards, i.e.,

Games: tetris, backgammon

Shortest path problems

Model for animals, people

Canonical Example: Grid World

Big rewards come at the end

In an MDP, we want an optimal policy *: S x 0:H A

A policy gives an action for each state for each time

An optimal policy maximizes expected sum of rewards

Contrast: In deterministic, want an optimal plan, or sequence of actions,

This is called a value update or Bellman update/back-up

= the expected sum of rewards accumulated when

Value Iteration in Gridworld

noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld

noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld

noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld

noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld

noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld

noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld

noise = 0.2, =0.9, two terminal states with R = +1 and -1

Exercise 1: Effect of discount, noise

(1) = 0.1, noise = 0.5

(2) = 0.99, noise = 0

(3) = 0.99, noise = 0.5

(4) = 0.1, noise = 0

Value Iteration Convergence

Convergence and Contractions

Define the max-norm:

Theorem: For any two approximations U and V

I.e. any distinct approximations must get closer to each other,

I.e. once the change in our approximation is small, it must also

Recall value iteration iterates:

Step 1: Policy evaluation: calculate utilities for some

This is policy iteration

Its still optimal!

Can converge faster under some conditions

Policy Evaluation Revisited

Idea 1: modify Bellman updates

Idea 2: it s just a linear system, solve with

Policy Iteration Guarantees

Infinite Horizon Linear Program

Recall, at value iteration convergence we have

LP formulation to find V*:

0 is a probability distribution over S, with 0(s)> 0 for all s 2 S.

Theorem. V* is the solution to the above LP.

Dual Linear Program

Equation 2: ensures has the above meaning

Equation 1: maximize expected discounted sum of rewards

Today and forthcoming lectures

Optimal control: provides general computational approach to tackle control

Dynamic programming / Value iteration

Optimal Control through Nonlinear Optimization

Exact methods on discrete state spaces (DONE!)

You might also like