Intro to
Reinforcement
Learning
prof. Carlo Lucibello
Department of Computing Sciences
Bocconi University
Machine learning paradigms
● Supervised learning: “learn to predict”
Machine learning paradigms
● Supervised learning: “learn to predict”
● Unsupervised learning: “learn the representation”
Machine learning paradigms
● Supervised learning: “learn to predict”
● Unsupervised learning: “learn the representation”
● Reinforcement learning: “learn to control”
Supervised Learning Example:
diagnosing heart abnormalities
● ML approach: teach the computer program
through examples
● Ask cardiologists to label a large number of
recorded ECG signals
● The learning algorithms adjusts the
program/model so that its predictions agree
with cardiologists’ labels
Supervised Learning
Training Set of input-output pairs:
deterministic prediction
Model:
probabilistic prediction
Reinforcement Learning
Games
Trading
Robotics
Reinforcement Learning in nature
Reinforcement Learning basics
Two key elements:
• Learning via trial and error
• No supervision
MAIN INGREDIENTS
● Agent: It’s me, Mario!
● State: Observations about the
world
● Actions: Decisions on what to
do next
● Rewards: Positive or negative
consequences of the actions
● Learning: Learn to choose the
action that gets the largest
reward!
Basic set-up
● An agent repeatedly interacts
and learns from a stochastic
environment with the goal of
maximizing cumulative rewards.
● “Trial-and-error” method of
learning.
● Algorithmic principles motivated
by psychology and neuroscience:
rewards provide a positive
reinforcement for an action.
Key features
Lack of a “supervisor”
● No labels telling us the best action to take
● We only have a reward signal
Delayed feedback
● The effect of an action may not be entirely visible instantaneously, but it may
affect the reward signal many steps later
Sequential decisions
● The sequence in which you make your moves will decide the path you take and
hence the final outcome
Actions affects observations
● Observations are a function of the agents’ own actions, which the agent may
decide based on its past observations
Agent & Environment
State (Level Info)
Dapeng Hong , Billy Wan, Yaqi Zhang
Action (Up, Down, Left, Right….)
Rewards
Finish the level
Still in game
Game over
Get a coin
What we want to learn?
Given the current
state and the
possible reward,
what is the best
action we can
choose?
Optimal POLICY!
LET'S FORMALIZE
MATHEMATICALLY
ALL THIS
State -> Action -> Reward -> State (SARS)
State (a vector of numbers)
Action (discrete choices)
Rewards
Finish the level:+100
Still in game: + 0.1 per sec
Game over: -10
Get a coin: +1
Return: cumulative reward
“Future”
reward…?
Discount factor
HOW WE CAN LEARN?
What we want to learn?
Given the current
state and the
possible reward,
what is the best
action we can
choose?
First approach: Policy Learning
How we can find the best action given a specific state?
We can “build” a function, the so called POLICY, and try to optimize it
Deterministic policy:
Stochastic policy:
Second Approach: Q-Learning
Given a policy, one can compute what is the average return (total reward)
obtained by playing a certain action in a certain state and then keep playing
according to the policy. This is called Q-function:
One can then use the Q-function to improve the
policy, then compute a new Q-function and so on.
E.g. greedy policy derived from Q-function:
HOW TO LEARN THE
Q-FUNCTION?
Q-Learning (Classic Algorithm)
Idea: what if we consider the expected value of reward for each action in the different states?
A1 A2 A3
S1 12 0 -10
S2 4 10 1000
● Q-Learning helps the agent make decisions by estimating the value of
different actions in different states
● It learns from experience which is the best action for each state
● You build your policy, the play the game many times and update the Q
function
Q-Table and Q-value
Learning the Q-table (pseudo-code)
Q-value update
Learning Rate Discount factor
Reward
after the the maximum value of
step the Q function for all
possible actions in the
new state
DOES IT REALLY WORK?
Real world case
Chess game:
● State: 10^50
● Action: 30-40 (legal ones)
or 2^18 (possible)
● Transitions to a new state are
deterministic, but depend on
the adversary
● Rewards: 0 for each
intermediate step, {-1,0,1} at
the end
Real world case
GO game:
● State: 3^361 (possible) or
10^170 (legal)
● Action: 200 (average)
or 361 (beginning)
● Transitions to a new state are
deterministic, but depend on
the adversary
● Rewards: 0 for each
intermediate step, {-1,0,1} at
the end
Real world case
Texas Hold ‘em:
● State: 10^31×4^8 (9 players)
● Action: 4 (fold,rise, call, check)
● Transitions to a new state are
stochastic and depend on the
adversary
● Rewards: 0 for each step, {−1, 0,
1} at the and
Real world case
Tram in Milano:
● State: 18×30×2
(lines × stops × (at or going to))
● Action: 100^3 (#tram^#state)
HOW DO DEAL
WITH COMPLEX
SETTINGS?
Q-Deep-learning
Constructing this function, even using Q-learning, could be an impossible task due to large state space.
However, we can approximate it.
And to do this, the best way is to use a neural network…
Q-Deep-learning
Q-Deep-learning (pseudocode)
A trained Q-network playing Super Mario!
AI Olympics
Tag Game - Evolving Strategies
It works!!! (Funny examples)
It works!!! (Funny examples)
Tricks and Problems
Exploration vs Exploitation
● Exploration: try new actions to gather information about the environment
● Exploitation: choose actions based on the current knowledge to maximize rewards
A balance between exploration and exploitation is essential!
Exploration vs Exploitation
epsilon-greedy policy
Stuck into the walls?
Global (or Delayed) Rewards: Do not look at the reward at every instant – look
at all rewards at the same time.
Delayed Rewards Delayed Punishment
Let's train our Neural Networks
using Pytorch and Google Colab!
https://tinyurl.com/BocconiRL
Other Links
https://tinyurl.com/BocconiFrozenLake
https://tinyurl.com/FrozenLakeTEO
https://huggingface.co/learn/deep-rl-course/unit0/introduction