0% found this document useful (0 votes)

83 views6 pages

Reinforcement Learning

The document discusses reinforcement learning and related concepts like Markov decision processes, value functions, and Q-learning. It explains how reinforcement learning differs from supervised learning by providing delayed rewards rather than explicit training examples. It also covers temporal difference learning and its use in applications like learning to play Backgammon through self-play.

Uploaded by

abasthana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views6 pages

Reinforcement Learning

Uploaded by

abasthana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Administration

Chapter 13:
Reinforcement Learning

Midterms due
Daily Show
Video

CS 536: Machine Learning

Littman (Wu, TA)

Reinforcement Learning

Control Learning

[Read Chapter 13]

[Exercises 13.1, 13.2, 13.4]
Control learning
Control policies that choose
optimal actions
Q learning
Convergence

Consider learning to choose actions,

like:
Robot learning to dock on battery
charger
Learning to choose actions to
optimize factory output
Learning to play Backgammon

Problem Characteristics

One Example: TD-Gammon

Note several problem characteristics:

Delayed reward
Opportunity for active exploration
Possibility that state only partially
observable
Possible need to learn multiple tasks
with same sensors/effectors

[Tesauro, 1995]
Learn to play Backgammon
Immediate reward
+100 if win
-100 if lose
0 for all other states
Trained by playing 1.5 million games
against itself.
Now, approximately equal to best human
player.

The RL Problem

Markov Decision Processes

Assume
finite set of states S; set of actions A
at each discrete time agent observes
state st in S and chooses action at in A
then receives immediate reward rt & state
changes to st+1
Markov assumption:

Goal: Learn to choose actions that maximize

r0 + " r1 + "2 r2 + ... , where 0 ! " < 1

rt = r(st, at) and st+1 = !(st, at) depend only on

current state and action
! and r may be nondeterministic
! and r not necessarily known to agent

Agent's Learning Task

Different Learning Problem

Execute actions in environment,

observe results, and
learn action policy #: S ) A that
maximizes
E[rt + " rt+1 + "2 rt+2 + ]

Note something new:

Target function is #: S ) A

from any starting state in S

here 0 ! " < 1 is the discount factor
for future rewards

Value Function
To begin, consider deterministic worlds...
For each possible policy # the agent might
adopt, we can define an evaluation
function over states
V#(s) $ rt+ " rt+1 + "2 rt+2 +
$ %i=0& "i rt+i
where rt, rt+1, are generated by following
policy # starting at state s.
Restated, the task is to learn the optimal
policy #': #' $ argmax# V#(s), ((s).

but we have no training examples of

form <s, a>
training examples are of form
<<s, a>, r>

Example MDP

What to Learn

Q Function

We might try to have agent learn the

evaluation function V#' (we write as V*)
It could then do a lookahead search to
choose best action from any state s
because
#*(s) = argmaxa [r (s, a) + " V*(!(s, a))]
A problem:
This works well if agent knows !: S+A)S,
and r: S+A) ,
But, when it doesn't, it can't choose
actions this way.

Define new function very similar to V*

Q(s, a) $ r(s, a) + " V*(!(s, a))]
If agent learns Q, it can choose
optimal action even without
knowing !!
#*(s) = argmaxa [r (s, a) + " V*(!(s, a))]
= argmaxa Q(s, a)
Q is the evaluation function the agent
will learn.

Training Rule to Learn Q

Q Learning in Deterministic Case

Note Q and V* closely related:

V*(s) = maxa Q(s, a)
This allows us to write Q recursively as
Q(st, at) = r(st, at) + " V*(!(st, at))
= r(st, at) + " maxa Q(st+1, a)
^
Nice! Let Q denote learner's current
approximation to Q. Use training rule
^
^
Q(s, a) * r + " maxa Q(s, a)
where s is the state resulting from
applying action a in state s.

For^ each s, a initialize table entry

Q(s, a) * 0
Observe current state s.
Do forever:
Select an action a and execute it
Receive immediate reward r
Observe the new state s
^
Update the table entry for Q(s, a) via:
^
^
Q(s, a) * r + " maxa Q(s, a)
s * s

^
Updating Q

Convergence Proof
^

Q(s1, aright) * r + " maxa Q(s2, a)

* 0 + 0.9 max {63, 81, 100} = 90
notice if rewards non-negative, then
^
^
((s, a, n) Qn+1(s, a) ! Qn(s, a)
and
^
((s, a, n) 0 " Qn(s, a) " Q(s, a)

Proof Continued
^

For table entry Qn(s, a) updated on iteration n +1,

Q converges to Q. Consider case of

deterministic world where see each
<s, a> visited infinitely often.
Proof: Define a full interval to be an interval
during which each <s, a> is visited.
During each full interval the largest error
^
in Q table is reduced by factor of "
^
Let Qn be table after n updates, and n be
the maximum error in Q^ n; that is
^
-n = maxs,a |Qn(s, a) - Q (s, a) |

Nondeterministic Case
What if reward and next state are
non-deterministic?
We redefine V,Q by taking expected
values
V#(s) $ E[rt + " rt+1 + "2 rt+2 + ]
$ E[%i=0& "i rt+i]
Q(s, a) $ E[r(s, a) + "V*(!(s, a))]

Nondeterministic Case

Temporal Difference Learning

Q learning generalizes to
nondeterministic worlds
Alter training rule ^to
^
Qn(s, a) * (1-/n)Qn-1(s,a) ^+
/n [r + " maxa Qn-1(s, a)]
where
/n = 1/(1 + visitsn(s, a)).
^
Can still prove convergence of Q to Q
[Watkins and Dayan, 1992].

Q learning: reduce discrepancy between successive

Q estimates
One step time difference:
Q(1)(st,at) $ rt + " maxa Q(st+1,a)
Why not 2 steps?
Q(2)(st,at) $ rt + " rt+1 + "2 maxa Q(st+2,a)
Or n ?
Q(n)(st,at) $ rt + + "(n-1) rt+n-1 + "n maxa Q(st+n,a)
Blend all of these:
Q .(st,at) $ (1-.) [Q(1)(st,at) + .Q(2)(st,at) + ]

Temporal Difference Learning

Subtleties & Ongoing Research

Q .(st,at) $ (1-.) [Q(1)(st,at) + .Q(2)(st,at) + ]

Equivalent expression:

^
Q .(st,at) $ rt+" [(1-.) maxa Q(st,at) + .Q .(st+1,at+1)]

TD(.) algorithm uses above training rule

Sometimes converges faster than Q
learning (not well understood in control
case)
converges for learning V for any 0"."1
(Dayan, 1992)
Tesauro's TD-Gammon uses this
algorithm to estimate the value function
via self play.

^
Replace Q table with neural net or other
generalizer
Handle case where state only partially
observable
Design optimal exploration strategies
Extend to continuous action, state
^
Learn and use !: S+A)S
Relationship to dynamic programming
and heuristic search

ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
Learning Task
No ratings yet
Learning Task
14 pages
Q Learing
No ratings yet
Q Learing
30 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Intro To Reinforcement Learning - DQ Q AC A3C
No ratings yet
Intro To Reinforcement Learning - DQ Q AC A3C
36 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
37 RL
No ratings yet
37 RL
18 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Module 5-1
No ratings yet
Module 5-1
12 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
New CZ3005 Module 5 - Reinforcement Learning
No ratings yet
New CZ3005 Module 5 - Reinforcement Learning
31 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
A12 Spring2024
No ratings yet
A12 Spring2024
5 pages
MAS Lab7 QFA
No ratings yet
MAS Lab7 QFA
10 pages
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
52 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
Reinforcement Learning II
No ratings yet
Reinforcement Learning II
28 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Reinforcement Learning II
No ratings yet
Reinforcement Learning II
28 pages
Deep Learning Binoy-19-3-RL Q Learning
No ratings yet
Deep Learning Binoy-19-3-RL Q Learning
26 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
ml4r 2025 05
No ratings yet
ml4r 2025 05
22 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
14 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
14 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
CZ3005 Module 5 - Reinforcement Learning
No ratings yet
CZ3005 Module 5 - Reinforcement Learning
31 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications
No ratings yet
Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications
22 pages
Q-Learning in RL With Openai Gym: Joo Soon Lee
No ratings yet
Q-Learning in RL With Openai Gym: Joo Soon Lee
34 pages
Non-Deterministic Reward and Action
No ratings yet
Non-Deterministic Reward and Action
2 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
16 RL
No ratings yet
16 RL
51 pages
3964 Double Q Learning
No ratings yet
3964 Double Q Learning
9 pages
ML11 ReinforcementLearning
No ratings yet
ML11 ReinforcementLearning
31 pages
Reinforcement Learning: 1 Updated Lecture Slides of Machine Learning Textbook, C Tom M. Mitchell, Mcgraw Hill, 1997
No ratings yet
Reinforcement Learning: 1 Updated Lecture Slides of Machine Learning Textbook, C Tom M. Mitchell, Mcgraw Hill, 1997
20 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
48 pages
10 ReinforcementLearning
No ratings yet
10 ReinforcementLearning
59 pages
Issues in Using Function Approximation For Reinforcement Learning
No ratings yet
Issues in Using Function Approximation For Reinforcement Learning
9 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
32 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Unit 5
No ratings yet
Unit 5
70 pages
13-RL DRL
No ratings yet
13-RL DRL
102 pages
Games2 6pp
No ratings yet
Games2 6pp
15 pages
ADA Lab Manual
No ratings yet
ADA Lab Manual
88 pages
Algorithm Complexity & Array Operations
No ratings yet
Algorithm Complexity & Array Operations
5 pages
RMT 2marks Lecture Notes 12345
No ratings yet
RMT 2marks Lecture Notes 12345
28 pages
FME7 Lecture 8 Secant Method
No ratings yet
FME7 Lecture 8 Secant Method
5 pages
Collections: Using System - Collection
No ratings yet
Collections: Using System - Collection
32 pages
Advanced Stack Implementations
No ratings yet
Advanced Stack Implementations
2 pages
Data Structues Unit - I
No ratings yet
Data Structues Unit - I
72 pages
Factoring Maze
100% (2)
Factoring Maze
2 pages
Lab 07 Adversarial Search
No ratings yet
Lab 07 Adversarial Search
27 pages
Sorting Search New
No ratings yet
Sorting Search New
15 pages
Top 200 DSA Questions With Links
No ratings yet
Top 200 DSA Questions With Links
6 pages
Data and Work Partitioning
No ratings yet
Data and Work Partitioning
19 pages
Ch2 3 Informed (Heuristic) Search1 - 0
No ratings yet
Ch2 3 Informed (Heuristic) Search1 - 0
82 pages
Machine Learning Test 2
No ratings yet
Machine Learning Test 2
6 pages
Dsa Practical
No ratings yet
Dsa Practical
9 pages
Hashing Presentation
No ratings yet
Hashing Presentation
12 pages
Trees in C++
100% (1)
Trees in C++
68 pages
4101 Assignment 9
No ratings yet
4101 Assignment 9
5 pages
Error Correcting Codes Guide
No ratings yet
Error Correcting Codes Guide
5 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
6 pages
Top YouTube Links for Learning Programming
No ratings yet
Top YouTube Links for Learning Programming
3 pages
Data Structure List of Practical's Semester - 3
100% (1)
Data Structure List of Practical's Semester - 3
3 pages
Algorithm Lab Manual
No ratings yet
Algorithm Lab Manual
70 pages
Module-2:Divide and Conquer
No ratings yet
Module-2:Divide and Conquer
26 pages
Graph Algorithms: Prim, Kruskal, BFS, DFS, Dijkstra
No ratings yet
Graph Algorithms: Prim, Kruskal, BFS, DFS, Dijkstra
7 pages
Gauss-Seidal Load Flow Analysis
No ratings yet
Gauss-Seidal Load Flow Analysis
9 pages
Machine Learning Lab Exam Guide
No ratings yet
Machine Learning Lab Exam Guide
2 pages
Distinct Subsets with Duplicates Guide
No ratings yet
Distinct Subsets with Duplicates Guide
6 pages
Stack and Recursion Lecture Notes
No ratings yet
Stack and Recursion Lecture Notes
36 pages
Tesla - LeetCode
No ratings yet
Tesla - LeetCode
3 pages

Reinforcement Learning

Uploaded by

Reinforcement Learning

Uploaded by

Administration

CS 536: Machine Learning

[Read Chapter 13]

Consider learning to choose actions,

One Example: TD-Gammon

Note several problem characteristics:

Markov Decision Processes

Goal: Learn to choose actions that maximize

rt = r(st, at) and st+1 = !(st, at) depend only on

Agent's Learning Task

Different Learning Problem

Execute actions in environment,

Note something new:

from any starting state in S

but we have no training examples of

We might try to have agent learn the

Define new function very similar to V*

Training Rule to Learn Q

Q Learning in Deterministic Case

Note Q and V* closely related:

For^ each s, a initialize table entry

Q(s1, aright) * r + " maxa Q(s2, a)

For table entry Qn(s, a) updated on iteration n +1,

Q converges to Q. Consider case of

Temporal Difference Learning

Q learning: reduce discrepancy between successive

Temporal Difference Learning

Subtleties & Ongoing Research

Q .(st,at) $ (1-.) [Q(1)(st,at) + .Q(2)(st,at) + ]

TD(.) algorithm uses above training rule

You might also like