Fall 2010 Graduate Course on
Unit - 5
Dynamic Learning
Chapter 6: MDP and Reinforcement
Learning
Reinforcement Learning
October 12, 2010
Byoung-Tak Zhang
School of Computer Science and Engineering &
Cognitive Science and Brain Science Programs
Seoul National University
http://bi.snu.ac.kr/~btzhang/
Overview
• Motivating
M ti ti Applications
A li ti
– Learning robots
– Web
W b agents
t
• Markov Decision Processes
– MDP
– POMDP
• Reinforcement Learning
– Adaptive Dynamic Programming (ADP)
– Temporal Difference (TD) Learning
– TD-Q Learning
2
Motivating Applications
• G
Generalized
li d model
d l learning
l i for
f reinforcement
i f learning
l i
on a humanoid robot:
http://www youtube com/watch?v=mRpX9DFCdwI
http://www.youtube.com/watch?v=mRpX9DFCdwI
• Autonomous spider learns to walk forward by
reinforcement learning:
http://www.youtube.com/watch?v=RZf8fR1SmNY&fe
ature=related
• Reinforcement learning for a robotic soccer goalkeeper:
http://www.youtube.com/watch?v=CIF2SBVY-
J0&feature=related
(c) 2010 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/ 4
WAIR: Reinforcement Learning
g Web Agent
g
for Personalized Information Filtering
Rewardi Actioni
WAIR
Learn
(modify profile)
Statei
Document Filtering
User Profile
Rewardi+1
(Relevance Feedback)
User ...
Filtered Documents
Zhang, B.-T. and Seo, Y.-W., Applied Artificial Intelligence, 15(7):665-685, 2001 5
Reinforcement Learning is
is…
… learningg from trial-and-error
and reward by interaction with an
environment.
Markov Decision Processes (MDPs)
• A simple, general framework for modeling
sequential decision making in a stochastic
environment
• Earlyy work by
y Bellman and Howard in the 1950s
• Currently a popular model in OR and AI
• Used to model problems in robotics
robotics, finance
finance,
agriculture, etc.
Markov Decision Processes (MDPs)
• The decision-making agent starts in some state and
chooses an action according to its policy
• This determines an immediate reward and causes a
stochastic transition to the next state
• Aim is to find the policy that leads to the highest total
reward over T time steps (finite horizon)
• Because
B off the
th Markov
M k property, t the
th policy
li does
d nott
have to remember previous states
The Formal Decision Problem - MDP
• Given <S, A, P, R, T, K>
– S is a finite state set (with start state s )
0
– A is a finite action set
– P(s’ | s, a) is a table of transition probabilities
– R(s, a, s’) is a reward function
• Policy π(s, t) = a
• Is there a policy that yields total reward at least K
over finite horizon T
• We will require that T < |S|
POMDP: Partially Observable MDP
• The agent
agent’ss observation is a function of the
current state
• Now you may need to remember previous
observations in order to act optimally
POMDP
• Given <S, A, P, R, Ω, O, T, K>
– Ω is a finite observation set
– O(o | s, a, s’) is a table of observation probabilities
• Policy π(o1, …, ot) = a
• Does there exist a policy that yields total
reward at least K over horizon T?
DEC POMDP2
DEC-POMDP
• Now two cooperating
p g ((decentralized)) agents
g
with different observation functions
DEC POMDP2
DEC-POMDP
• Given <S, A, P, R, Ω1, Ω2, O, T, K>
– Ω1 and Ω2 are finite observation sets
– O(o1, o2 | s, a1, a2, s’) is a table of observation probabilities
• Local policies π1(o11, …, o1t) = a1, π2(o21, …, o2t) = a2
• Joint policy <π1, π2>
• Does there exist a joint policy that yields total reward at
least K over horizon T??
Note: DEC-POMDP1 = POMDP
DEC MDP2
DEC-MDP
• DEC-POMDP2 with requirement that the state is
uniquely determined by the tuple of observations:
J : Ω1 × Ω 2 → S
Note: DEC-MDP1 = MDP
MDP
• Can use dynamic programming to “back up” values
from the end
⎛ ⎞
V ( s ) = max a∈A ⎜ R ( s, a ) + ∑ P ( s ' | s, a )V ( s ') ⎟
⎝ s '∈S ⎠
Elements of Reinforcement Learning
0 100 0
90 100 0
0 G G
0 0
0 0 0 0
100
0 0 81 90 100
r(state, action) V*(state) values
immediate reward values
• Value function: maps states to state values
V (s ) ≡ r (t )+γr (t +1)+γ2 r (t+2 )+...
Discount factor γ ∈ [0, 1) (here 0.9)
Elements of MDP
• A set S of states {s1, s2, s3, …}
• A set A of actions {a1, a2, a3, …}
• A transition function T: S x A Æ S (deterministic)
or T: S x A x S Æ [0, 1] (stochastic)
• A reward function R: S x A Æ Real
or R: S x A x S Æ Real
• A policy π: S Æ A (deterministic)
or π: S x A Æ [[0, 1]] ((stochastic))
Optimality Criteria
Suppose an agent receives
S i a reward
d rt at time
i t. Then
Th optimal
i l
behaviour might:
• Maximise the sum of expected
p future rewards: ∑r
t
t
T
• Maximise over a finite horizon: ∑r
t =0
t
∞
• Maximise over an infinite horizon: ∑r
t =0
t
∞
• Maximise over a discounted infinite horizon: ∑ rt
χ t
t =0
1 n
• Maximise average reward: lim ∑ rt
n →∞ n
t =1
Examples
p of MDPs
• Goal-directed, Indefinite Horizon, Cost Minimization MDP
– <S, A, Pr, C, G, s0>
– Most often studied in planning community
• Infinite Horizon, Discounted Reward Maximization MDP
– <S, A, Pr, R, γ>
– Most often studied in reinforcement learning
• Goal-directed,, Finite Horizon,, Prob. Maximization MDP
– <S, A, Pr, G, s0, T>
– Also studied in planning community
• Oversubscription Planning: Non absorbing goals, Reward Max. MDP
– <S, A, Pr, G, R, s0>
– Relatively recent model
Assumptions
• First-Order Markovian dynamics (history independence)
– Pr(St+1|At,St,At-1,St-1,..., S0) = Pr(St+1|At,St)
– Next state only depends on current state and current action
• First-Order
First Order Markovian reward process
– Pr(Rt|At,St,At-1,St-1,..., S0) = Pr(Rt|At,St)
– Reward only depends on current state and action
– As described earlier we will assume reward is specified by a deterministic
function R(s)
• i.e. Pr(Rt=R(St) | At,St) = 1
• Stationary dynamics and reward
– Pr(St+1|At,St) = Pr(Sk+1|Ak,Sk) for all t, k
– The world dynamics do not depend on the absolute time
• F ll observability
Full b bili
– Though we can’t predict exactly which state we will reach when we execute an
action, once it is realized, we know what it is
20
Policies ((“Plans” for MDPs))
• Nonstationary policy
– π:S x T → A, where T is the non-negative integers
– π(s,t) is action to do at state s with t stages-to-go
– What if we want to keep acting indefinitely?
• Stationary policy
– π:S → A
– π(s) is action to do at state s (regardless of time)
– specifies a continuously reactive controller
• These assume or have these properties:
– full observability
– history-independence
– deterministic action choice
21
Value of a Policy
• How good is a policy π?
• How do we measure “accumulated” reward?
• V
Value u c o V: S → assoc
ue function associates
a es value
va ue with
w each
eac
state (or each state and time for non-stationary π)
• Vπ(s) denotes value of policy at state s
– Depends on immediate reward, but also what you achieve
subsequently by following π
– An optimal policy is one that is no worse than any other
ppolicyy at anyy state
• The goal of MDP planning is to compute an optimal
ppolicyy (method
( depends
p on how we define value))
22
Finite-Horizon Value Functions
• We first consider maximizing total reward over a
fi i horizon
finite h i
• Assumes the agent has n time steps to live
• To act optimally, should the agent use a
stationary
y or non-stationaryy policy?
p y
• Put another way:
– If you had only one week to live would you act the
same way as if you had fifty years to live?
23
Finite Horizon Problems
• Value (utility) depends on stage-to-go
– hence so should policy: nonstationary π(s,k)
k
• Vπ (s ) is k-stage-to-go value function for π
– expected total reward after executing π for k time steps
k
Vπk ( s ) = E [ ∑ R t | π , s ]
t =0
k
= E [ ∑ R( s t ) | a t = π ( s t , k − t ), s = s 0 ]
t =0
• Here Rt and st are random variables denoting the reward
received and state at stage t respectively
24
Computing
p g Finite-Horizon Value
• Can use dynamic programming to compute Vπk (s )
– Markov property is critical for this
0
(a) Vπ ( s ) = R ( s ),
) ∀s
(b) V k ( s ) = R ( s ) +
π ∑ s' T ( s, π ( s, k ),
) s ' ) ⋅ Vπk −1 ( s ' )
immediate reward
expected
t d future
f t payoff
ff
with k-1 stages to go
π(s,k)
0.7
What is time complexity?
03
0.3
25
Vk Vk-1
Bellman Backup
H
How can we compute
t optimal
ti l Vt+1(s)
( ) given
i optimal
ti l Vt ?
Compute Vt
E
Expectations
i
s1
Compute 07
0.7
a1
Max s2
0.3
Vt+1(s) s
0.4 s3
a2
0.6 s4
0 7 Vt (s1) + 0.3
Vt+1(s) = R(s)+max { 0.7 0 3 Vt (s4)
26 0.4 Vt (s2) + 0.6 Vt(s3) }
Value Iteration: Finite Horizon Case
• Markov property allows exploitation of DP principle
f optimal
for ti l policy
li construction
t ti
– no need to enumerate |A|Tn possible policies
• Value Iteration Bellman backup
0
V ( s ) = R( s ),
) ∀s
V ( s ) = R( s ) + max ∑ T ( s, a, s ') ⋅V
k k −1
( s ')
a
s '
π * ( s, k ) = arg max ∑ s ' T ( s, a, s ')) ⋅V k −1 ( s '))
a
Vk is optimal
p k-stage-to-go
g g value function
f
Π*(s,k) is optimal k-stage-to-go policy
27
Value Iteration
V3 V2 V1 V0
s1
s2
0 7
0.7 0 7
0.7 0 7
0.7
0.4 0.4 0.4
s3
0.6 0.6 0.6
0.3 0.3 0.3
s4
V1(s4) = R(s4)+max { 0.7 V0 (s1) + 0.3 V0 (s4)
0.4 V0 (s2) + 0.6 V0(s3) }
28
Value Iteration
V3 V2 V1 V0
s1
s2
0 7
0.7 0 7
0.7 0 7
0.7
0.4 0.4 0.4
s3
0.6 0.6 0.6
0.3 0.3 0.3
s4
Π*(s4,t) = max { }
29
Discounted Infinite Horizon MDPs
• Defining value as total reward is problematic with
infinite horizons
– many or all policies have infinite expected reward
– some MDPs are ok (e (e.g.,
g zero
zero-cost
cost absorbing states)
• “Trick”: introduce discount factor 0 ≤ β < 1
– future
f rewards
d discounted
di d by
b β per time
i step
∞
Vπk ( s ) = E [ ∑ β t R t | π , s ]
t =0
∞
• Note: V ( s ) ≤ E [ β t R max ] = 1
π ∑ 1− β
R max
t =0
• Motivation: economic? failure prob? convenience?
30
Computing an Optimal Value Function
• Bellman equation for optimal value function
V * ( s ) = R( s ) + β max ∑ T ( s, a, s ')V * ( s ')
a
s'
– Bellman
B ll provedd this
thi is
i always
l true
t
• How can we compute the optimal value function?
– Th
The MAX operator t makes
k ththe system
t non-linear,
li so the
th problem
bl isi more
difficult than policy evaluation
• Notice that the optimal value function is a fixed
fixed-point
point of
the Bellman Backup operator B
– B takes a value function as input
p and returns a new value function
B[V ]( s ) = R( s ) + β max ∑ T ( s, a, s ')V ( s ')
a
s'
31
Value Iteration
• Can compute
p optimal
p ppolicy
y using
g value iteration,,
just like finite-horizon problems (just include
discount term))
V 0 ( s) = 0
V k ( s ) = R( s ) + β max ∑ T ( s, a, s ')V k −1 ( s ')
a
s '
• Will converge
g to the optimal
p value function as k ggets
large. Why?
32
Policy Evaluation
• Value
l equation
i for
f fixed
fi d policy
li
Vπ ( s ) = R( s ) + β∑ T ( s, π ( s )), s '))Vπ ( s '))
s'
• How can we compute the value function for a
policy?
– we are given R and Pr
– simple linear system with n variables (each variables is
value of a state)) and n constraints ((one value equation
q
for each state)
– Use linear algebra (e.g. matrix inverse)
33
Policyy Iteration
• Given fixed policy, can compute its value exactly:
Vπ ( s ) = R( s ) + β ∑ T ( s, π ( s ), s ')Vπ ( s ')
s'
• Policy iteration exploits this: iterates steps of policy evaluation
and policy improvement
1. Choose a random policy π
Policy improvement
2 L
2. Loop:
(a) Evaluate Vπ
(b) For each s in S,
S set π '(( s ) = arg max ∑ s 'T (s, a, s '))Vπ (s '))
(c) Replace π with π’ a
Until no improving action possible at any state
34
Policy Iteration Notes
• Each step of policy iteration is guaranteed to
strictly improve the policy at some state when
improvement is possible
• Convergence assured (Howard)
– intuitively: no local maxima in value space, and each
policy must improve value; since finite number of
policies will converge to optimal policy
policies,
• Gives exact value of optimal policy
35
Value Iteration vs. Policy Iteration
• Which is faster? VI or PI
– It depends on the problem
• VI takes more iterations than PI, but PI requires
more time on each iteration
– PI must perform policy evaluation on each step which
i l
involves solving
l i a linear
li system
t
• Complexity:
– Th
There are att mostt exp(n)
( ) policies,
li i so PI is
i no worse
than exponential time in number of states
– Empirically O(n) iterations are required
– Still no polynomial bound on the number of PI
iterations (open problem)!
36
Adaptive Dynamic Programming (ADP)
For each s, initialize V(s) , P(s’|s,a) and R(s,a)
Initialize s to current state that is perceived
Loop forever
{
Select an action a and execute it (using current model R and P) using
g max(( R( s, a) + γ ∑ P( s ' |, s, a)V ( s '))
a = arg ))
a s'
Receive immediate reward r and observe the new state s’
U i the
Using h transition
i i tuple
l <s,a,s’,r>
’ to update
d model d l R and
dP
For all the sate s, update V(s) using the updating rule
V ( s ) = max(( R( s, a ) + γ ∑ P( s ' | s, a )V ( s '))
a
s'
s = s’
}
How to Learn Model?
• U
Use the
h transition
i i tuple l <s, a, s’,
’ r> to learn
l
P(s’|s,a) and R(s,a). That’s supervised learning!
– Since the agent can get every transition (s, a, s’,r)
directly, so take (s,a) and s’ as an input-output example
of the transition probability function P(s
P(s’|s
|s,a).
a)
– Different techniques in the supervised learning
• Neural networks
• Decision trees
– Use r and P(s |s,a) to learn R(s,a)
P(s’|s,a)
R ( s, a ) = ∑ P ( s ' | s, a ) r
s'
From ADP to TD Learning
• In each step,
step ADP updates V for all states ss’
V ( s ) = max( R( s, a ) + γ ∑ P( s ' | s, a )V ( s '))
a
s'
– This is intractable for large state space!
– Improve this by prioritized-sweeping.
• Temporal
p difference ((TD)) learning:
g
– Adjust the estimated utility value V of the current state
based on its immediate reward R(s) and the estimated
value of the next state s’.
V ( s ) = V ( s ) + α ( R( s ) + V ( s ')) − V ( s ))
TD-Q
Q Learning
g
• Define Q-value function
V ( s ) = max Q( s, a )
a
Q( s, a) = R( s, a ) + γ ∑ P( s ' | s, a )V ( s '))
s'
= R ( s, a ) + γ ∑ P ( s ' | s, a ) max Q( s ', a ')
a'
s'
• Recall: TD learning
V ( s ) = V ( s ) + α ( R( s) + V ( s ')) − V ( s ))
• Key idea of TD-Q learning
Q( s, a ) = Q( s, a) + α (r + γ max Q( s ' , a ' ) − Q( s, a))
a'
• No
Note:
e: Delta
e rule
u e (neural
( eu network
e wo learning)
e g)
w(t ) = w(t ) + α (d − o) x
TD Q Learning algorithm
TD-Q
For each pair (s,a), initialize Q(s,a)
Observe the current state s
Loop forever
{
Select an action a and execute it
a = arg max Q( s, a )
a
Receive immediate reward r and observe the new state s’
Update Q(s,a)
Q( s, a ) = Q( s, a ) + α (r + γ max Q( s ' , a ' ) − Q( s, a))
a'
s=s’
}
Generalization in RL
• So far we assumed that all the functions
learned by the agent are (V(V, P, R Q) are
P R,
tabular forms, i.e. it is possible to enumerate
state
t t andd action
ti spaces.
• Use ggeneralization techniques
q to deal with
large state or action space.
– Function approximation techniques
– Neural networks
– Other supervised/unsupervised learning methods
Alternatives
• Function approximation of the Q-table:
– Neural
N l networks
t k
– Decision trees
– Gradient descent methods
• Reinforcement learning variants:
– Relational reinforcement learning
– Hierarchical reinforcement learning
– Intrinsically motivated reinforcement
learning
References and Further Reading
• Sutton, R., Barto, A., (2000) Reinforcement Learning:
an Introduction, The MIT Press
http://www.cs.ualberta.ca/~sutton/book/the-book.html
• Kaelbling, L., Littman, M., Moore, A., (1996)
Reinforcement Learning: a Survey, Journal of
Artificial Intelligence Research, 4:237-285
• Barto, A., Mahadevan, S., (2003) Recent Advances in
Hierarchical Reinforcement Learning, Discrete Event
Applications, 13(4):41-
Dynamic Systems: Theory and Applications (4):41
77