10 REINFORCEMENT LEARNING
Fall 2023 CS54118 Machine Learning
Credits
1. B2: Machine learning: an algorithmic perspective. 2nd Edition,
Marsland, Stephen. CRC press, 2015
2. https://www.samyzaf.com/ML/rl/qmaze.html
3. http://www.gwydir.demon.co.uk/jo/maze/makemaze/index.htm
4. https://www.freecodecamp.org/news/an-introduction-to-q-learning-
reinforcement-learning-14ac0b4493cc/
Assignment
Read:
B2: Chapter 11
Problems:
B2:11.1, 11.2, 11.4
Reinforcement Learning
Information is provided about whether or not the answer is correct, but
not how to improve it.
Reinforcement learner has to try out different strategies and see which
work best.
Tryingout ≡ searching, is a fundamental part of any reinforcement learner
Searches over the state space of possible inputs and outputs in order to try
to maximize a reward.
Reinforcement Learning: Comes from Animal intelligence
You repeat an action that gives you more satisfaction (or reward). Such an
action gets reinforced with the situation that caused it.
An Example
Robot mouse has no
idea of the room
layout – obstructions
and ways
Robot mouse only
knows how to
recognize block with
cheese – the end
goal.
Restricted Actions
Can move only in
one of the four ways
Approach: explore
and learn
Learning Part
Cell (3,5) and action 2
(going right) results in
punishment
Punishmentcan be denoted
by some –ve value.
Cell (3,9) and action 3
(going down) results in just
another block
That is, neither hitting an
obstruction nor reaching the
goal
Can be represented by a
zero value
Cell (9,10) and action 3
(going down) results in
reaching the goal
Can be represented by a
high positive value
This also makes the cell
(9,10) very valuable
because it has a way to
reach the goal
Cell (9,9) and action 2
(going right) results in
reaching the valuable cell
(9,10)
Can be given a high positive
value
Hence, cell (9,9) also
becomes valuable.
How much “history” we
want to maintain is also a
choice.
After learning sufficiently,
the best action for each
state will look something
like this
Three things
Current input or the state
Possible things that can be done or
the actions
Aim is to maximize reward
Agent: that is doing or learning
Environment: where the agent acts
• State: sensor readings
• May not tell everything to the robot
• There may be noise/inaccuracies
• Actions: possible ways in which the robot can
drive its motors
• Actions on the environment
• Reward: how well it navigates to the goal
with minimal crashing
Reward can be delayed or it can be instant
E.g. a robot travelling a maze vs. a robot driving a car
Policy: Choice of action that should be taken
Should be a combination of exploitation and exploration
Policy: exploitation vs. Start
exploration
Exploitation
Takethe best action
learnt so far
Exploration
Givesub-optimal actions a
chance
Reward can be delayed or it can be instant
E.g. a robot travelling a maze vs. a robot driving a car
Policy: Choice of action that should be taken
Should be a combination of exploration and exploitation
Absorbing state: your goal state, when the solution you were searching
for is found
Also called Terminal or Accepting state.
State space: set of all states that the learner can experience
Action space: set of all actions that the learner can take
EXAMPLE: GETTING LOST
• You know that 𝐹 is the
absorbing state ,because you
are vaguely familiar with it
• You decide to reward yourself
with chips only if you reach 𝐹
You don’t have the city
map with you, hence,
you are not aware of
the following reward
matrix
Action or Next State
Carrots and Sticks: The Reward Function
Reward Function
Input:
Current State and Chosen Action
Output: Numerical reward based on them
It is generated by the environment around
the learner
It is not internal to the learner
Two parts
An intermediate part (at every step or so)
A pay-off in the end
The reward tells the learner
what the goal is, not how the
goal should be achieved, which
would be supervised learning.
Usually a bad idea to include
sub-goals like speed up of
learning
Learner can find methods of
achieving sub-goals without
actually achieving the real goal
For continual tasks (e.g., learning to walk): there is no terminal state
In general, we want to predict the reward into the infinite future
Solution: Discounting:
We discount future rewards depending upon how “far” they are in the future
Where, 0 < 𝛾 < 1 is the discounting factor
Action Selection
Each action is assigned a value based on
Thecurrent state
How it has been rewarded in the past
A reward action 𝑎 has received on being taken 𝑡 times form the state
𝑠: 𝑄𝑠,𝑡 (𝑎)
Will eventually converge to true predictions
Based on prediction of 𝑄𝑠,𝑡 (𝑎), there are three methods of choosing 𝑎
Greedy: pick action with highest 𝑄𝑠,𝑡 (𝑎)
𝛜-Greedy: similar to greedy, but small probability where we pick some
other action at random
Works better than greedy in practice.
Soft-max: refinement of 𝜖-Greedy
𝝉 (temperature):
Large 𝜏: all actions have similar probability
Small 𝜏: individual probabilities matter more
Policy
Choice of which action to take in each state in order to get optimal
results: 𝜋
It is a mapping from states to actions.
Goal: learn better 𝜋 for each state 𝑠𝑡 as we proceed.
Two things:
How much past information we need to know while getting into the current
state (use of Markov Decision Process)
How we assign a value to the current state
MARKOV DECISION PROCESSES
Is the information of current state sufficient to compute reward for the
next move (e.g., chess)
Such a state is called Markov state
OR
We need to consider how we got to the current state, i.e., a sense of
history needs to be stored.
A reinforcement learning problem that follows Markov state property
is known as a Markov Decision Process (MDP)
The number of possible states and actions are assumed to be finite.
Markov Chain: States + transition probabilities
Markov Decision Process: Extension of Markov Chain with actions and
rewards
Markov Chain
A simple example of a Markov decision process to decide on the state of
your mind tomorrow given your state of mind today.
Markov Decision Process
• Addition: Selecting an action
does not guarantee a state.
• We add action nodes coming
out of state nodes.
• Multiple state nodes can be
connected to an action node.
Transition Diagram
MDP for getting lost example
VALUES
Value denotes the expected future reward the
reinforcement learner is trying to maximize
Two ways:
State-value function 𝑉(𝑠): average the reward across all of the actions that can be
taken
Action-value function 𝑄(𝑠, 𝑎): consider each possible action that can taken from the
current state separately.
Two ways:
State-value function 𝑉(𝑠): average the reward across all of the actions that can be
taken
Action-value function 𝑄(𝑠, 𝑎): consider each possible action that can taken from the
current state separately.
State-value function is less accurate but easier to compute
Action-value function is more accurate but requires more information to compute
Now, we have two problems
Deciding which action to take, i.e., the policy
Predicting the value function
Optimal policy, 𝜋 ∗ , is the one (not necessarily unique) in which the
value function is the greatest over all possible states.
Optimal value functions can be of two types:
Optimal state value function:
Optimal action value function:
Optimal state value function:
Assumes taking the optimal action in each case
Optimal action value function:
Considers taking action 𝑎 this time, and then following the optimal policy
from then on
Hence, optimal action value = current reward + discounted estimate of
the future reward
How do we update values?
Updated Estimate ← Current Estimate +
𝜇(New Estimate – Current Estimate)
Temporal Difference (TD) method
If we are remembering previous 𝜆 states and also updated their value
functions, then the above algorithm is called 𝑇𝐷(𝜆).
TD(0)
Off-policy decision
On-policy decision
Recall Getting Lost
Example
Reward Matrix
Action or Next State
Q-learning Example
For the getting lost example, consider the following trail of steps for
an iteration: A,B,A,B,D,D,B,C,F Action or Next State
States A B C D E F
And the following initial Q matrix:
A -0.039 -0.028 -0.026 -0.018 0.013 0.011
B 0.028 -0.029 0.024 0.044 0.013 -0.026
C -0.018 -0.042 -0.017 -0.002 -0.026 0.047
D 0.013 0.025 -0.004 0.043 -0.015 0.011
E -0.018 0.011 -0.018 -0.009 -0.038 -0.033
F 0.013 -0.018 0.039 -0.026 0.011 -0.002
Compute matrix updates as during the iteration. Use
𝜇 learning rate = .6 and 𝛾 (discounting factor) = .3
Action or Next State
Trail: A,B,A,B,D,D,B,C,F States A B C D E F
𝜇 = .6 A -0.039 -0.028 -0.026 -0.018 0.013 0.011
𝛾 = .3 B 0.028 -0.029 0.024 0.044 0.013 -0.026
C -0.018 -0.042 -0.017 -0.002 -0.026 0.047
D 0.013 0.025 -0.004 0.043 -0.015 0.011
For 𝐴 → 𝐵: 𝑠 = 𝐴, 𝑎 = 𝐵, 𝑠 ′ = 𝐵, 𝑟 = 0
E -0.018 0.011 -0.018 -0.009 -0.038 -0.033
𝑄 𝐴, 𝐵 = 𝑄 𝐴, 𝐵
F 0.013 -0.018 0.039 -0.026 0.011 -0.002
+.6 0 + .3 × max 𝑄(𝐵, 𝑎′ ) − 𝑄(𝐴, 𝐵)
𝑎′ Updated A B C D E F
= −0.028 + .6 .3 × 0.044 + 0.028
A -0.039 −𝟎. 𝟎𝟎𝟑 -0.026 -0.018 0.013 0.011
= −0.003
B 0.028 -0.029 0.024 0.044 0.013 -0.026
For 𝐵 → 𝐴: 𝑠 = 𝐵, 𝑎 = 𝐴, 𝑠 ′ = 𝐴, 𝑟 = 0
C -0.018 -0.042 -0.017 -0.002 -0.026 0.047
D 0.013 0.025 -0.004 0.043 -0.015 0.011
E -0.018 0.011 -0.018 -0.009 -0.038 -0.033
F 0.013 -0.018 0.039 -0.026 0.011 -0.002
Q matrix after iteration # 10 Q matrix after iteration # 100 Q matrix after iteration # 1000
[[ -0.0 0.0 -inf -0.0 0.0 0.0] [[ -0.0 16.0 -inf -0.0 -inf -inf] [[ 1.4 16.0 -inf -0.0 -inf -inf]
[ -0.0 0.0 38.2 -0.0 0.0 -inf] [ 5.8 7.7 40.0 -0.0 0.0 -inf] [ 6.4 11.0 40.0 16.0 0.0 -inf]
[ -0.0 -0.0 -0.0 -0.0 0.0 99.8] [ -0.0 -0.0 34.1 6.2 0.0 100.0] [ -0.0 16.0 35.0 16.0 0.0 100.0]
[ -0.0 0.0 -0.0 -0.0 -0.0 -inf] [ -0.0 16.0 -0.0 0.8 -0.0 -inf] [ -0.0 16.0 39.7 10.9 40.0 -inf]
[ -0.0 -0.0 -0.0 0.0 -0.0 91.0] [ -0.0 -0.0 -0.0 0.0 -0.0 100.0] [ -0.0 -0.0 -0.0 16.0 35.0 100.0]
[ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]] [ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]] [ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]]
Q matrix after iteration # 30 Q matrix after iteration # 500
[[ -0.0 16.0 -inf -0.0 -inf -inf] [[ 1.4 16.0 -inf -0.0 -inf -inf]
[ 6.4 11.0 40.0 15.7 0.0 -inf]
[ 4.3 0.0 40.0 -0.0 0.0 -inf]
[ -0.0 16.0 35.0 15.9 0.0 100.0]
[ -0.0 -0.0 24.5 4.4 0.0 100.0]
[ -0.0 16.0 36.4 8.0 40.0 -inf]
[ -0.0 15.9 -0.0 0.8 -0.0 -inf]
[ -0.0 -0.0 -0.0 15.6 34.1 100.0]
[ -0.0 -0.0 -0.0 0.0 -0.0 99.8]
[ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]]
[ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]] [ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]]
Sample run of 1000 iterations for Q-learning (TD-0)
algorithm
Sarsa learning Example
For the getting lost example, consider the following trail of steps taken
by the agent: D,B,B,B,B,C,C,C,F,C
Action or Next State
And the following initial Q matrix States A B C D E F
A -0.039 -0.028 -0.018 0.011 -0.026 0.011
B 0.028 -0.029 0.024 0.044 -0.018 -0.026
C 0.011 -0.042 -0.017 -0.002 -0.026 0.047
D -0.018 0.025 -0.004 0.043 -0.015 -0.018
E 0.011 -0.018 0.011 -0.009 -0.038 -0.033
F -0.018 -0.026 0.039 -0.026 0.011 -0.002
Compute matrix updates as during the iteration. Use
𝜇 learning rate = .6 and 𝛾 (discounting factor) = .3
Action or Next State
Trail: D,B,B,B,B,C,C,C,F,C States A B C D E F
𝜇 = .6 A -0.039 -0.028 -0.018 0.011 -0.026 0.011
𝛾 = .3 B 0.028 -0.029 0.024 0.044 -0.018 -0.026
C 0.011 -0.042 -0.017 -0.002 -0.026 0.047
D -0.018 0.025 -0.004 0.043 -0.015 -0.018
For 𝐷 → 𝐵: 𝑠 = 𝐷, 𝑎 = 𝐵, 𝑠 ′
= 𝐵, 𝑟 =0, 𝑎′
=𝐵
E 0.011 -0.018 0.011 -0.009 -0.038 -0.033
𝑄 𝐷, 𝐵 = 𝑄 𝐷, 𝐵
+.6 0 + .3 × 𝑄(𝐵, 𝐵) − 𝑄(𝐷, 𝐵) F -0.018 -0.026 0.039 -0.026 0.011 -0.002
Updated
= 0.025 + .6 .3 × (−0.029) − 0.025 A B C D E F
= 0.005 A -0.039 -0.028 -0.018 0.011 -0.026 0.011
B 0.028 -0.029 0.024 0.044 -0.018 -0.026
For 𝐵 → 𝐵: 𝑠 = 𝐵, 𝑎 =𝐵, 𝑠 ′
= 𝐵, 𝑟 = −5,
C 0.011 -0.042 -0.017 -0.002 -0.026 0.047
𝑎′ = 𝐵
D -0.018 𝟎. 𝟎𝟎𝟓 -0.004 0.043 -0.015 -0.018
E 0.011 -0.018 0.011 -0.009 -0.038 -0.033
F -0.018 -0.026 0.039 -0.026 0.011 -0.002
Q matrix after iteration # 10 Q matrix after iteration # 100 Q matrix after iteration # 1000
[[ -4.5 0.0 -0.0 -0.0 -0.0 -0.0] [[ -4.5 16.0 -0.0 -0.0 -0.0 -0.0] [[ 1.4 16.0 -0.0 -0.0 -0.0 -0.0]
[ -0.0 0.0 38.4 0.0 -0.0 -0.0] [ -0.0 7.7 40.0 0.0 -0.0 -0.0] [ 5.4 4.4 40.0 7.4 -0.0 -0.0]
[ -0.0 -0.0 -0.0 -0.0 0.0 99.9] [ -0.0 11.2 31.8 6.6 0.0 100.0] [ -0.0 9.1 35.0 11.9 0.0 100.0]
[ -0.0 -0.0 -0.0 0.0 -0.0 0.0] [ -0.0 -0.0 40.0 -4.5 -0.0 -inf] [ -0.0 7.3 14.5 10.9 38.2 -inf]
[ -0.0 -0.0 -0.0 -0.0 -0.0 97.3] [ -0.0 -0.0 -0.0 -0.0 -0.0 100.0] [ -0.0 -0.0 -0.0 14.0 32.8
[ -0.0 -0.0 0.0 -0.0 0.0 0.0]] [ -0.0 -0.0 0.0 -0.0 0.0 0.0]] 100.0]
Q matrix after iteration # 30 Q matrix after iteration # 500 [ -0.0 -0.0 0.0 -0.0 0.0 0.0]]
[[ -4.5 4.3 -0.0 -0.0 -0.0 -0.0] [[ 0.3 16.0 -0.0 -0.0 -0.0 -0.0]
[ -0.0 7.7 40.0 0.0 -0.0 -0.0] [ 3.0 10.4 40.0 14.0 -0.0 -0.0]
[ -0.0 -0.0 -0.0 -0.0 0.0 100.0] [ -0.0 6.6 35.0 12.9 0.0 100.0]
[ -0.0 -0.0 36.4 -4.5 -0.0 -inf] [ -0.0 8.7 21.2 -4.5 40.0 -inf]
[ -0.0 -0.0 -0.0 -0.0 -0.0 99.9] [ -0.0 -0.0 -0.0 15.8 -0.0 100.0]
[ -0.0 -0.0 0.0 -0.0 0.0 0.0]] [ -0.0 -0.0 0.0 -0.0 0.0 0.0]]
Sample run of 1000 iterations for SARSA algorithm
SARSA vs. Q
Every move gets a reward of -1
Moves that end up on cliff get a reward of -100
The SARSA solution is far from The Q-learning solution is optimal, but
optimal, but it is safe. occasionally the random search will tip it
over the cliff.
SARSA vs. Q
It
is our choice depending upon what we want
Depends upon how serious a problem is falling from the cliff.
Getting lost example… again!
Reward matrix used:
[[ -5.0 0.0 -inf -inf -inf -inf]
[ 0.0 -5.0 0.0 0.0 -inf -inf]
[ -inf 0.0 -5.0 0.0 -inf 100.0]
[ -inf 0.0 0.0 -5.0 0.0 -inf]
[ -inf -inf -inf 0.0 -5.0 100.0]
[ -inf -inf 0.0 -inf -inf 0.0]]
Q matrix after iteration # 10 Q matrix after iteration # 100 Q matrix after iteration # 1000
[[ -0.0 0.0 -inf -0.0 0.0 0.0] [[ -0.0 16.0 -inf -0.0 -inf -inf] [[ 1.4 16.0 -inf -0.0 -inf -inf]
[ -0.0 0.0 38.2 -0.0 0.0 -inf] [ 5.8 7.7 40.0 -0.0 0.0 -inf] [ 6.4 11.0 40.0 16.0 0.0 -inf]
[ -0.0 -0.0 -0.0 -0.0 0.0 99.8] [ -0.0 -0.0 34.1 6.2 0.0 100.0] [ -0.0 16.0 35.0 16.0 0.0 100.0]
[ -0.0 0.0 -0.0 -0.0 -0.0 -inf] [ -0.0 16.0 -0.0 0.8 -0.0 -inf] [ -0.0 16.0 39.7 10.9 40.0 -inf]
[ -0.0 -0.0 -0.0 0.0 -0.0 91.0] [ -0.0 -0.0 -0.0 0.0 -0.0 100.0] [ -0.0 -0.0 -0.0 16.0 35.0 100.0]
[ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]] [ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]] [ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]]
Q matrix after iteration # 30 Q matrix after iteration # 500
[[ -0.0 16.0 -inf -0.0 -inf -inf] [[ 1.4 16.0 -inf -0.0 -inf -inf]
[ 6.4 11.0 40.0 15.7 0.0 -inf]
[ 4.3 0.0 40.0 -0.0 0.0 -inf]
[ -0.0 16.0 35.0 15.9 0.0 100.0]
[ -0.0 -0.0 24.5 4.4 0.0 100.0]
[ -0.0 16.0 36.4 8.0 40.0 -inf]
[ -0.0 15.9 -0.0 0.8 -0.0 -inf]
[ -0.0 -0.0 -0.0 15.6 34.1 100.0]
[ -0.0 -0.0 -0.0 0.0 -0.0 99.8]
[ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]]
[ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]] [ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]]
Q-learning (TD-0) algorithm – note the number of -inf
Q matrix after iteration # 10 Q matrix after iteration # 100 Q matrix after iteration # 1000
[[ -4.5 0.0 -0.0 -0.0 -0.0 -0.0] [[ -4.5 16.0 -0.0 -0.0 -0.0 -0.0] [[ 1.4 16.0 -0.0 -0.0 -0.0 -0.0]
[ -0.0 0.0 38.4 0.0 -0.0 -0.0] [ -0.0 7.7 40.0 0.0 -0.0 -0.0] [ 5.4 4.4 40.0 7.4 -0.0 -0.0]
[ -0.0 -0.0 -0.0 -0.0 0.0 99.9] [ -0.0 11.2 31.8 6.6 0.0 100.0] [ -0.0 9.1 35.0 11.9 0.0 100.0]
[ -0.0 -0.0 -0.0 0.0 -0.0 0.0] [ -0.0 -0.0 40.0 -4.5 -0.0 -inf] [ -0.0 7.3 14.5 10.9 38.2 -inf]
[ -0.0 -0.0 -0.0 -0.0 -0.0 97.3] [ -0.0 -0.0 -0.0 -0.0 -0.0 100.0] [ -0.0 -0.0 -0.0 14.0 32.8
[ -0.0 -0.0 0.0 -0.0 0.0 0.0]] [ -0.0 -0.0 0.0 -0.0 0.0 0.0]] 100.0]
Q matrix after iteration # 30 Q matrix after iteration # 500 [ -0.0 -0.0 0.0 -0.0 0.0 0.0]]
[[ -4.5 4.3 -0.0 -0.0 -0.0 -0.0] [[ 0.3 16.0 -0.0 -0.0 -0.0 -0.0]
[ -0.0 7.7 40.0 0.0 -0.0 -0.0] [ 3.0 10.4 40.0 14.0 -0.0 -0.0]
[ -0.0 -0.0 -0.0 -0.0 0.0 100.0] [ -0.0 6.6 35.0 12.9 0.0 100.0]
[ -0.0 -0.0 36.4 -4.5 -0.0 -inf] [ -0.0 8.7 21.2 -4.5 40.0 -inf]
[ -0.0 -0.0 -0.0 -0.0 -0.0 99.9] [ -0.0 -0.0 -0.0 15.8 -0.0 100.0]
[ -0.0 -0.0 0.0 -0.0 0.0 0.0]] [ -0.0 -0.0 0.0 -0.0 0.0 0.0]]
SARSA algorithm – note the number of -inf
USES OF REINFORCEMENT LEARNING
Most popular is in intelligent robotics
The robot can be left to attempt to solve the task without human intervention.
Also popular at learning how to play games
Modelling Sample Problems
To model sample problems, we need to model
𝑄 matrix (state-action matrix)
Reward
Remaining things like choice of policy, choice of algorithm, etc., are
simply choices, whose options are well established.
They don’t need to be modelled.
Wading Through Maze
Taken from
https://towardsdatascience.com/the-
other-type-of-machine-learning-
97ab81306ce9
Also see:
https://samyzaf.com/ML/rl/qmaze.h
tml
Wading Through Maze - II
Q-Table
Credit: https://www.freecodecamp.org/news/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc/
Tic-Tac-Toe