Introduction to
Reinforcement Learning
DAVIDE BACCIU – BACCIU@DI.UNIPI.IT
Preliminaries
DAVIDE BACCIU - UNIVERSITÀ DI PISA 2
The Course (3 CFU – ~24h)
✓RL Fundamentals
✓Markov Decision Processes
✓Planning by Dynamic Programming
✓Model-Free Prediction & Control
✓Value Function Methods Lectures by me
✓Policy Gradient Methods Guest lectures by Maurizio Parton
✓Exploration and Exploitation Student seminars
✓Deep reinforcement learning
✓Advanced topics and applications: continual learning, meta-learning, inverse RL, RL
frameworks, some case studies
DAVIDE BACCIU - UNIVERSITÀ DI PISA 3
Course Completion
✓ M.Sc. students
✓Prepare a 15 minutes presentation to be given in one of the 2 student lecture dates
✓ 5 minutes Q&A on the content of the presentation
✓ Seminar content
✓ Read 3 relevant papers on a topic of interest for the course; summarize their content and confront the methods
✓ Implement 1 RL method from literature and attempt a validation on a simple application; describe the model, its implementation
and the validation results
✓ Ph.D. Students
✓ Read 3 (or more) relevant papers on a topic of interest for the course and summarize their content
in a report (6-10 pages single column, NeurIPS format)
✓ Sketch/propose a novel RL method/application: report your idea in sufficient detail in a short
paper (6-10 pages single column, NeurIPS format)
✓ Implement a RL-based application and validate it: prepare a short presentation to report the
results (10-15 slides describing the model, the implementation and the results)
✓ Contact me and agree on alternative ways (e.g. using RL in your Ph.D. project, …)
DAVIDE BACCIU - UNIVERSITÀ DI PISA 4
Fill in your contact details here
https://docs.google.com/spreadsheets/d/10fcU3SVYb5wOYd-
c6uag9kk5spq6O1Q86m0fv2aSWsA/edit?usp=sharing
DAVIDE BACCIU - UNIVERSITÀ DI PISA 5
Resources
The course webpage (elearning.di.unipi.it/course/view.php?id=227)
✓Course calendar & news
✓Slides, video lectures, additional materials
✓Course assignment upload
Reference Book
Richard S. Sutton and Andrew G. Barto, Reinforcement
Learning: An Introduction, Second Edition, MIT Press
(available online)
DAVIDE BACCIU - UNIVERSITÀ DI PISA 6
A Note
Much of the content of this course and its slides are heavily
based on the classical course by David Silver
https://www.davidsilver.uk/teaching/
DAVIDE BACCIU - UNIVERSITÀ DI PISA 7
Introduction
DAVIDE BACCIU - UNIVERSITÀ DI PISA 8
Positioning
Reinforcement
Learning
DAVIDE BACCIU - UNIVERSITÀ DI PISA 9
What characterizes Reinforcement
Learning (vs other ML tasks)?
✓No supervisor: only a reward signal
✓Delayed asynchronous feedback
✓Time matters (sequential data, continual learning)
✓Agent’s actions affect the subsequent data it receives (inherent
non-stationarity)
DAVIDE BACCIU - UNIVERSITÀ DI PISA 10
Using Reinforcement Learning
✓Learning to maneuver vehicles
✓Learn to control robots (walking, navigation, manipulation)
✓Manage portfolios
✓Play games
✓Discover new molecules
✓End-to-end learning with discrete structures
DAVIDE BACCIU - UNIVERSITÀ DI PISA 11
Game Playing
DAVIDE BACCIU - UNIVERSITÀ DI PISA 12
Navigation
DAVIDE BACCIU - UNIVERSITÀ DI PISA 13
Manipulation
https://www.youtube.com/watch?v=jwSbzNHGflM
DAVIDE BACCIU - UNIVERSITÀ DI PISA 14
Formalizing
Reinforcement Learning
DAVIDE BACCIU - UNIVERSITÀ DI PISA 15
Rewards
✓A reward 𝑅𝑡 is a scalar feedback signal
✓Indicates how well agent is doing at step t
✓The agent’s job is to maximise cumulative reward
Reinforcement learning is based on the reward hypothesis
✓All goals can be described by the maximisation of expected
cumulative reward
DAVIDE BACCIU - UNIVERSITÀ DI PISA 16
What is a Reward?
✓Learning to drive a car (+ve reward for getting places safely - -ve reward for
crashing)
✓Make a humanoid robot walk (+ve reward for forward motion, -ve reward for
tripping over)
✓Make a robot arm manipulate objects (+ve reward for goal achievement, -ve
reward for object falling)
✓Manage an investment portfolio (+ve reward for each $ in bank)
✓Play games (+=-ve reward for increasing/decreasing score )
✓Discover new molecules (+ve reward for synthesizable molecule, -ve reward for
toxic molecule)
DAVIDE BACCIU - UNIVERSITÀ DI PISA 17
Sequential Decision Making
✓Goal: select actions to maximise total future reward
✓Actions may have long term consequences
✓Reward may be delayed
✓It may be better to sacrifice immediate reward to gain more long-term reward
✓Examples:
✓A financial investment (may take months to mature)
✓Refuelling a helicopter (might prevent a crash in several hours)
✓Blocking opponent moves (might help winning chances many moves from now)
DAVIDE BACCIU - UNIVERSITÀ DI PISA 18
Agent and
Environment
✓At each step t the agent:
✓ Executes action At
✓ Receives observation 𝑶𝒕
✓ Receives scalar reward 𝑹𝒕
✓The Environment:
✓ Receives action At
✓ Emits observation 𝑶𝒕+𝟏
✓ Emits scalar reward 𝑹𝒕+𝟏
✓t increments at environment step
DAVIDE BACCIU - UNIVERSITÀ DI PISA 19
History and State
The history is the sequence of observations, actions, rewards
𝐻𝑡 = 𝑂1 ; 𝑅1 ; 𝐴1 … 𝐴𝑡−1 ; 𝑂𝑟 ; 𝑅𝑡
✓i.e. all observable variables up to time t
✓i.e. the sensorimotor stream of a robot or embodied agent
What happens next depends on the history:
✓The agent selects actions
✓The environment selects observations/rewards
State 𝑆𝑡 is the information used to determine what happens next and is a
function of history
𝑆𝑡 = 𝑓(𝐻𝑡 )
DAVIDE BACCIU - UNIVERSITÀ DI PISA 20
Environment State
✓The environment state 𝑆𝑡𝑒 is the
environment 𝑒 private representation at
time 𝑡
✓Whatever information the environment uses
to generate the next observation/reward
✓The environment state is not usually
visible to the agent (unobservable
environment)
✓ Even if 𝑆𝑡𝑒 is visible, it may contain
irrelevant information
DAVIDE BACCIU - UNIVERSITÀ DI PISA 21
Agent State
✓The agent state 𝑆𝑡𝑎 the internal
representation owned by agent 𝑎
✓Whatever information the agent uses
to select next action
✓The agent state is the information
used by reinforcement learning
algorithms
✓Generally speaking a function of
history
𝑆𝑡𝑎 = 𝑓(𝐻𝑡 )
DAVIDE BACCIU - UNIVERSITÀ DI PISA 22
Information (Markov) State
An information state (Markov state) contains all useful information from
the history
Definition (Markov State)
A state 𝑆𝑡 is Markov if and only if
𝑃(𝑆𝑡+1 𝑆1 , … , 𝑆𝑡 = 𝑃(𝑆𝑡+1 |𝑆𝑡 )
✓The future is independent of the past given present (d-separation)
𝐻1:𝑡 → 𝑆𝑡 → 𝐻𝑡+1:∞
✓The state is a sufficient statistics for the future
✓The environment state 𝑆𝑡𝑒 is Markov
✓The history 𝐻𝑡 is Markov
DAVIDE BACCIU - UNIVERSITÀ DI PISA 23
What’s the best (agent) state model?
DAVIDE BACCIU - UNIVERSITÀ DI PISA 24
Fully Observable Environment
✓Full observability ⟹ Agent
directly observes the environment
state
𝑂𝑡 = 𝑆𝑡𝑎 = 𝑆𝑡𝑒
✓Formally this is a Markov Decision
Process (MDP)
✓Next lecture (and much of the RL
literature)
DAVIDE BACCIU - UNIVERSITÀ DI PISA 25
Partially Observable Environment
✓Partial observability ⟹ Agent indirectly observes the environment
✓A robot with camera vision only may not know absolute location
✓A trading agent only observes current prices
✓A poker player only observes public cards
✓Formally 𝑆𝑡𝑎 ≠ 𝑆𝑡𝑒 and the problem is a Partially Observable Markov Decision
Process (POMDP)
✓The agent needs to build its own state representation 𝑆𝑡𝑎
✓History: 𝑆𝑡𝑎 = 𝐻𝑡
✓Beliefs on environment state: 𝑆𝑡𝑎 = [𝑃(𝑆𝑡𝑒 = 𝑠1 ) … 𝑃(𝑆𝑡𝑒 = 𝑠 𝑁 )]
𝑎
✓A dynamic memory (RNN): 𝑆𝑡𝑎 = 𝜎(𝑊𝑠 𝑆𝑡−1 + 𝑊𝑜 𝑂𝑡 )
DAVIDE BACCIU - UNIVERSITÀ DI PISA 26
Components of a
Reinforcement Learning
Agent
DAVIDE BACCIU - UNIVERSITÀ DI PISA 27
Key Components of an RL Agent
✓Policy: agent’s behaviour function
✓Value function: how good is each state and/or action
✓Model: agent’s representation of the environment
An RL agent may include one or more of the above
DAVIDE BACCIU - UNIVERSITÀ DI PISA 28
Policy
✓A policy 𝜋 is the agent’s behaviour
✓It is a map from state 𝑠 to action 𝑎
✓Deterministic policy: a = 𝜋(𝑠)
✓Stochastic policy: 𝜋 𝑎 𝑠 = 𝑃(𝐴𝑡 = 𝑎|𝑆𝑡 = 𝑠)
DAVIDE BACCIU - UNIVERSITÀ DI PISA 29
Value Function
How “good” is a specific
state/action for an agent?
DAVIDE BACCIU - UNIVERSITÀ DI PISA 30
Value Function
✓The value function 𝑣 is a predictor of future reward
✓Used to evaluate the goodness/badness of states
✓And therefore to select between actions, e.g
𝑣𝜋 (𝑠) = 𝔼𝜋 [𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾 2 𝑅𝑡+2 + ⋯ |𝑆𝑡 = 𝑠]
Expected (discounted) future reward following policy 𝜋 from state 𝑠
DAVIDE BACCIU - UNIVERSITÀ DI PISA 31
Model
✓A model predicts what the environment will do
next
✓Predict next state 𝑠′ following an action 𝑎
𝑎
𝒫𝑠𝑠′ = 𝑃(𝑆𝑡+1 = 𝑠 ′ |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎)
✓Predict next reward
ℛ𝑠𝑎 = 𝔼[𝑅𝑡+1 |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎]
DAVIDE BACCIU - UNIVERSITÀ DI PISA 32
A Forever Classic - The Maze Example
✓ Rewards: -1 per
time-step
✓Actions: N, E, S, W
✓States: Agent
location
DAVIDE BACCIU - UNIVERSITÀ DI PISA 33
Maze Example
(Policy)
Arrows represent
policy 𝜋(𝑠) for each
state 𝑠
DAVIDE BACCIU - UNIVERSITÀ DI PISA 34
Maze Example
(Value Function)
Numbers denote the
value v𝜋 𝑠 for each 𝑠
Expected time to reach
the goal
DAVIDE BACCIU - UNIVERSITÀ DI PISA 35
Maze Example
(Model)
✓Agent may have an internal
(imperfect) model of the
environment
✓How actions change the state
✓How much reward from each state
✓Grid Layout: transition model
𝑎
𝒫𝑠𝑠 ′
✓ Numbers: immediate reward
model ℛ𝑠𝑎
DAVIDE BACCIU - UNIVERSITÀ DI PISA 36
Characterizing RL Agents (I)
✓Value Based
✓Policy (Implicit)
✓Value Function
✓Policy Based
✓Policy
✓Value Function
✓Actor Critic
✓Policy
✓Value Function
DAVIDE BACCIU - UNIVERSITÀ DI PISA 37
Characterizing RL Agents (II)
✓Model Free
✓Model
✓Policy and/or Value Function
✓Model Based
✓Model
✓Policy and/or Value Function
DAVIDE BACCIU - UNIVERSITÀ DI PISA 38
A Taxonomy
DAVIDE BACCIU - UNIVERSITÀ DI PISA 39
Problems within
Reinforcement Learning
DAVIDE BACCIU - UNIVERSITÀ DI PISA 40
Learning and Planning
Two fundamental problems in sequential decision making
✓Reinforcement Learning
✓The environment is initially unknown
✓The agent interacts with the environment
✓The agent improves its policy
✓Planning (reasoning, introspection, search,…)
✓A model of the environment is known
✓The agent performs computations with its model (no external interaction)
✓The agent improves its policy
DAVIDE BACCIU - UNIVERSITÀ DI PISA 41
Atari Example: Reinforcement Learning
Atari Example – Reinforcement learning
✓Rules of the game are
unknown
✓Learn directly from
interactive game-play
✓Pick actions on joystick,
see pixels and scores
DAVIDE BACCIU - UNIVERSITÀ DI PISA 42
Atari Example: Reinforcement Learning
Atari Example – Planning
✓Rules of the game are known
✓Agent contains emulator (model)
✓If I take action 𝑎 from state 𝑠:
✓what would the next state be?
✓what would the score be?
✓Plan ahead to find optimal policy
✓e.g. tree search
DAVIDE BACCIU - UNIVERSITÀ DI PISA 43
Exploration Vs Exploitation
✓Reinforcement Learning follows a trial-and-error process
✓The agent should discover a good policy
✓From its experiences of the environment
✓Without losing too much reward along the way
✓Exploration finds more information about the environment
✓Exploitation exploits known information to maximise reward
Effective reinforcement learning requires to trade
between exploration and exploitation
DAVIDE BACCIU - UNIVERSITÀ DI PISA 44
Examples
✓Restaurant Selection
✓Exploitation - Go to your favourite restaurant
✓Exploration - Try a new restaurant
✓Holiday planning
✓Exploitation – The camping site you go to since you are born
✓Exploration – Hitchhike and follow the flow
✓Game Playing
✓Exploitation - Play the move you believe is best
✓Exploration - Play an experimental move
DAVIDE BACCIU - UNIVERSITÀ DI PISA 45
Prediction & Control
✓Prediction: evaluate the future
✓Given a policy
✓Control: optimise the future
✓Find the best policy
DAVIDE BACCIU - UNIVERSITÀ DI PISA 46
Gridworld Example - Prediction
What is the value function for the uniform random policy?
DAVIDE BACCIU - UNIVERSITÀ DI PISA 47
Gridworld Example - Control
𝑣∗ 𝜋∗
What is the optimal value function over all possible policies?
What is the optimal policy?
DAVIDE BACCIU - UNIVERSITÀ DI PISA 48
Wrap-up
DAVIDE BACCIU - UNIVERSITÀ DI PISA 49
Take home messages
✓Reinforcement learning is a general-purpose framework for decision-
making
✓Reinforcement learning is for an agent with the capacity to act and
observe
✓ The state is the sufficient statistics to characterize the future
✓Depends on the history of actions and observations
✓Environment state Vs Agent state
✓Success is measured by a scalar reward signal
✓The goal is to select actions to maximise future reward (exploit)
✓In order to be effective we should not forget to explore
DAVIDE BACCIU - UNIVERSITÀ DI PISA 50
Next Lecture
Markov Decision Processes (MDP)
✓A formal model for (observable) environments in reinforcement
learning
✓Foundational approach to reinforcement learning
✓Markov property and processes
✓Discounted rewards
✓Bellman expectation
✓Extending to partially-observable environments
DAVIDE BACCIU - UNIVERSITÀ DI PISA 51
Short-term Calendar (tentative)
✓L2 – Thursday 01 Apr h. 14-16
✓L3 - Wednesday 7 Apr h. 14-16
✓L4 – Tuesday 27 Apr h. 14-16
✓L5 - Tuesday 04 May h. 14-16
✓L6 - Monday 10 May h. 16-18
✓L7-L9 – Guest lecture by Maurizio Parton
DAVIDE BACCIU - UNIVERSITÀ DI PISA 52