Administration
Chapter 13:
Reinforcement Learning
Midterms due
Daily Show
Video
CS 536: Machine Learning
Littman (Wu, TA)
Reinforcement Learning
Control Learning
[Read Chapter 13]
[Exercises 13.1, 13.2, 13.4]
Control learning
Control policies that choose
optimal actions
Q learning
Convergence
Consider learning to choose actions,
like:
Robot learning to dock on battery
charger
Learning to choose actions to
optimize factory output
Learning to play Backgammon
Problem Characteristics
One Example: TD-Gammon
Note several problem characteristics:
Delayed reward
Opportunity for active exploration
Possibility that state only partially
observable
Possible need to learn multiple tasks
with same sensors/effectors
[Tesauro, 1995]
Learn to play Backgammon
Immediate reward
+100 if win
-100 if lose
0 for all other states
Trained by playing 1.5 million games
against itself.
Now, approximately equal to best human
player.
The RL Problem
Markov Decision Processes
Assume
finite set of states S; set of actions A
at each discrete time agent observes
state st in S and chooses action at in A
then receives immediate reward rt & state
changes to st+1
Markov assumption:
Goal: Learn to choose actions that maximize
r0 + " r1 + "2 r2 + ... , where 0 ! " < 1
rt = r(st, at) and st+1 = !(st, at) depend only on
current state and action
! and r may be nondeterministic
! and r not necessarily known to agent
Agent's Learning Task
Different Learning Problem
Execute actions in environment,
observe results, and
learn action policy #: S ) A that
maximizes
E[rt + " rt+1 + "2 rt+2 + ]
Note something new:
Target function is #: S ) A
from any starting state in S
here 0 ! " < 1 is the discount factor
for future rewards
Value Function
To begin, consider deterministic worlds...
For each possible policy # the agent might
adopt, we can define an evaluation
function over states
V#(s) $ rt+ " rt+1 + "2 rt+2 +
$ %i=0& "i rt+i
where rt, rt+1, are generated by following
policy # starting at state s.
Restated, the task is to learn the optimal
policy #': #' $ argmax# V#(s), ((s).
but we have no training examples of
form <s, a>
training examples are of form
<<s, a>, r>
Example MDP
What to Learn
Q Function
We might try to have agent learn the
evaluation function V#' (we write as V*)
It could then do a lookahead search to
choose best action from any state s
because
#*(s) = argmaxa [r (s, a) + " V*(!(s, a))]
A problem:
This works well if agent knows !: S+A)S,
and r: S+A) ,
But, when it doesn't, it can't choose
actions this way.
Define new function very similar to V*
Q(s, a) $ r(s, a) + " V*(!(s, a))]
If agent learns Q, it can choose
optimal action even without
knowing !!
#*(s) = argmaxa [r (s, a) + " V*(!(s, a))]
= argmaxa Q(s, a)
Q is the evaluation function the agent
will learn.
Training Rule to Learn Q
Q Learning in Deterministic Case
Note Q and V* closely related:
V*(s) = maxa Q(s, a)
This allows us to write Q recursively as
Q(st, at) = r(st, at) + " V*(!(st, at))
= r(st, at) + " maxa Q(st+1, a)
^
Nice! Let Q denote learner's current
approximation to Q. Use training rule
^
^
Q(s, a) * r + " maxa Q(s, a)
where s is the state resulting from
applying action a in state s.
For^ each s, a initialize table entry
Q(s, a) * 0
Observe current state s.
Do forever:
Select an action a and execute it
Receive immediate reward r
Observe the new state s
^
Update the table entry for Q(s, a) via:
^
^
Q(s, a) * r + " maxa Q(s, a)
s * s
^
Updating Q
Convergence Proof
^
Q(s1, aright) * r + " maxa Q(s2, a)
* 0 + 0.9 max {63, 81, 100} = 90
notice if rewards non-negative, then
^
^
((s, a, n) Qn+1(s, a) ! Qn(s, a)
and
^
((s, a, n) 0 " Qn(s, a) " Q(s, a)
Proof Continued
^
For table entry Qn(s, a) updated on iteration n +1,
^
the error in the revised estimate Qn+1(s, a) is
^
|Qn+1(s, a) - Q(s, a)|
^
= |(r + " maxa Qn(s, a)) - (r + " maxa Q(s, a))|
^
= " | maxa Qn(s, a) - maxa Q(s, a)|
^
" " maxa | Qn(s, a) - Q(s, a)|
^
" " maxa,s | Qn(s, a) - Q(s, a)|
^
|Qn+1(s, a) - Q (s, a) | " " -n
Note that we used the fact that
| maxa f1(a) - maxa f2(a)| " maxa | f1(a) - f2(a)|
Q converges to Q. Consider case of
deterministic world where see each
<s, a> visited infinitely often.
Proof: Define a full interval to be an interval
during which each <s, a> is visited.
During each full interval the largest error
^
in Q table is reduced by factor of "
^
Let Qn be table after n updates, and n be
the maximum error in Q^ n; that is
^
-n = maxs,a |Qn(s, a) - Q (s, a) |
Nondeterministic Case
What if reward and next state are
non-deterministic?
We redefine V,Q by taking expected
values
V#(s) $ E[rt + " rt+1 + "2 rt+2 + ]
$ E[%i=0& "i rt+i]
Q(s, a) $ E[r(s, a) + "V*(!(s, a))]
Nondeterministic Case
Temporal Difference Learning
Q learning generalizes to
nondeterministic worlds
Alter training rule ^to
^
Qn(s, a) * (1-/n)Qn-1(s,a) ^+
/n [r + " maxa Qn-1(s, a)]
where
/n = 1/(1 + visitsn(s, a)).
^
Can still prove convergence of Q to Q
[Watkins and Dayan, 1992].
Q learning: reduce discrepancy between successive
Q estimates
One step time difference:
Q(1)(st,at) $ rt + " maxa Q(st+1,a)
Why not 2 steps?
Q(2)(st,at) $ rt + " rt+1 + "2 maxa Q(st+2,a)
Or n ?
Q(n)(st,at) $ rt + + "(n-1) rt+n-1 + "n maxa Q(st+n,a)
Blend all of these:
Q .(st,at) $ (1-.) [Q(1)(st,at) + .Q(2)(st,at) + ]
Temporal Difference Learning
Subtleties & Ongoing Research
Q .(st,at) $ (1-.) [Q(1)(st,at) + .Q(2)(st,at) + ]
Equivalent expression:
^
Q .(st,at) $ rt+" [(1-.) maxa Q(st,at) + .Q .(st+1,at+1)]
TD(.) algorithm uses above training rule
Sometimes converges faster than Q
learning (not well understood in control
case)
converges for learning V for any 0"."1
(Dayan, 1992)
Tesauro's TD-Gammon uses this
algorithm to estimate the value function
via self play.
^
Replace Q table with neural net or other
generalizer
Handle case where state only partially
observable
Design optimal exploration strategies
Extend to continuous action, state
^
Learn and use !: S+A)S
Relationship to dynamic programming
and heuristic search