Introduction to
Reinforcement Learning
What is reinforcement learning?
▪ Sutton and Barto, 1998: “Reinforcement learning is learning what to do - how to map situations to
actions - so as to maximize a numerical reward signal”.
What is reinforcement learning?
▪ Sutton and Barto, 1998: “Reinforcement learning is learning what to do - how to map situations to
actions - so as to maximize a numerical reward signal”.
▪ ChatGPT, 2022: “Reinforcement learning is a type of machine learning in which an agent learns to
interact with its environment in order to maximize a reward signal”.
What is reinforcement learning?
▪ Sutton and Barto, 1998: “Reinforcement learning is learning what to do - how to map situations to
actions - so as to maximize a numerical reward signal”.
▪ ChatGPT, 2022: “Reinforcement learning is a type of machine learning in which an agent learns to
interact with its environment in order to maximize a reward signal”.
Agent Environment
- State st
- Take action at
- Next state st+1
- Get reward r(st, at)
Example: Multi-agent game
2019: Learning to play hide and seek via
multi-agent reinforcement learning [1].
Recent advances
2013
Atari 2016
Deep Q-learning for
Atari games [2]. Energy saving 2017
DeepMind AI AlphaGo/ 2018
reduces
AlphaZero OpenAI Five 2019
Google data
centre cooling AI achieving grand Training five Alpha Star
bill by 40% [3]. master level in artificial 2022
chess, go, and intelligence AI achieving grand
shogi [4,5]. agents to play master level in AlphaTensor
the Dota 2 [6]. StarCraft II game [7].
Discovering faster
matrix multiplication
Rubik’s Cube
algorithms [9].
Solving Rubik's Cube
with a human-like robot ChatGPT
hand [8].
A language model
trained to generate
human-like responses
to text input [10].
(Potential) real world applications
▪ Robotics
Teaching a robot how to walk in the wild [11].
(Potential) real world applications
▪ Robotics
▪ Autonomous driving
Action at
Next state st+1
and reward r(st, at)
(Potential) real world applications
▪ Robotics
▪ Autonomous driving
▪ Control of power grids
Control of power grids [12].
An interdisciplinary field
Many facets of reinforcement learning [13].
Markov Decision processes
A Markov decision process is given by a tuple ℳ = (𝒮, 𝒜, P, ρ) where…
▪ 𝒮 is the set of all possible states.
▪ 𝒜 is the set of all possible actions.
▪ P is the transition law with P(s′| s, a) = Pr(st+1 = s′| st = s, at = a).
▪ ρ is the initial state distribution with ρ(s) = Pr(s0 = s).
Markov property
Pr(st+1 | st, st−1, . . . , s0, at, at−1, . . . . , a0) = Pr(st+1 | st, at) → stochastic dynamical system!
Example: Deterministic MDP
Time evolution
▪ Start in s0 ∼ ρ.
▪ At each time t:
• Take action at ∈ 𝒜.
• End up in state st+1 ∼ P( ⋅ | st, at).
Example: Stochastic MDP (Gridworld)
Time evolution
▪ Start in s0 ∼ ρ.
▪ At each time t:
• Take action at ∈ 𝒜.
• End up in state st+1 ∼ P( ⋅ | st, at).
Policy
Decision rule
▪ In state s ∈ 𝒮, we take action a ∈ 𝒜 with
probability π(a | s).
▪ π( ⋅ | s) is a probability distribution over 𝒜.
Objective
Reward and discount factor
▪ Reward function r : 𝒮 × 𝒜 → ℝ.
▪ Discount rate γ ∈ (0,1).
Objective function
The goal is to find an optimal policy π* maximizing
[∑ ]
J(π) := 𝔼 γ tr(st, at) | s0 ∼ ρ, π
t=0
Example: Stochastic MDP (Gridworld)
Time evolution
▪ Start in s0 ∼ ρ.
▪ At each time t:
• Take action at ∼ π( ⋅ | st).
• Get reward r(st, at).
• End up in state st+1 ∼ P( ⋅ | st, at).
Reinforcement learning vs. optimal control
Stochastic optimal control
∞
[∑ ]
For known P: dynamic programming. Still
max 𝔼 γ tr(st, at) | s0 ∼ ρ, π →
π very hard for large 𝒮 and 𝒜.
t=0
Reinforcement learning
▪ In RL we can only sample from the MDP (in simulation or real world), but don’t know P.
▪ We need to explore the environment.
Many different approaches
Taxonomy of reinforcement learning approaches [14].
Policy gradient method
[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0
Policy gradient method
[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0
▪ Policy Optimization: Parameterize the policy as πθ(a | s) and then find the best policy.
Policy gradient method
[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0
▪ Policy Optimization: Parameterize the policy as πθ(a | s) and then find the best policy.
▪ Direct parameterization
∑
πθ(a | s) = θs,a, where θ ∈ ℝ|𝒮|×|𝒜|, θs,a ≥ 0 and θs,a = 1.
a∈𝒜
Policy parameterization
▪ Softmax parameterization
exp (θs,a)
πθ(a | s) = , where θ ∈ ℝ|𝒮|×|𝒜|
∑a′∈𝒜 exp (θs,a′)
Policy parameterization
▪ Softmax parameterization
exp (θs,a)
πθ(a | s) = , where θ ∈ ℝ|𝒮|×|𝒜|
∑a′∈𝒜 exp (θs,a′)
▪ Neural softmax parameterization
exp (fθ (s, a))
πθ(a | s) = , where fθ (s, a) represents a neural network.
∑a′∈𝒜 exp (fθ (s, a′))
Policy gradient method
[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0
▪ Parameterize the policy as πθ(a | s).
▪ Using gradient ascent method to find best policy π*
▪ Pseudo-code for policy gradient method
Policy gradient method
[∑ ]
max J(πθ) := 𝔼 γ tr(st, at) | s0 ∼ ρ, at ∼ πθ( ⋅ | st)
θ
t=0
▪ Parameterize the policy as πθ(a | s).
▪ Using gradient ascent method to find best policy π*
▪ Pseudo-code for policy gradient method
Results for policy gradient method
▪ Can we converge to the optimal policy when K is big enough?——Non-convexity may lead to sub-
optimal policy
Results for policy gradient method
▪ Can we converge to the optimal policy when K is big enough?——Non-convexity may lead to sub-
optimal policy
▪ Convergence for direct parametrization [15]:
(1 − γ)3 c1
Let ρ(s) > 0, ∀s ∈ 𝒮 and α ≤ , we have min J(π*) − J(πt) ≤ .
• 2γ | 𝒜 | t≤K K
Results for policy gradient method
▪ Can we converge to the optimal policy when K is big enough?——Non-convexity may lead to sub-
optimal policy
▪ Convergence for direct parametrization [15]:
(1 − γ)3 c1
Let ρ(s) > 0, ∀s ∈ 𝒮 and α ≤ , we have min J(π*) − J(πt) ≤ .
• 2γ | 𝒜 | t≤K K
▪ Convergence for softmax parametrization [16]:
(1 − γ)3 c2
• Let ρ(s) > 0, ∀s ∈ 𝒮 and α ≤ 8
, we have J(π*) − J(πK ) ≤
K
.
▪ Here, c1, c2 are the constant that depend on MDP ℳ = (𝒮, 𝒜, P, ρ, r, γ).
How to compute the gradient ∇θ J(πθ)?
▪ For every random trajectory τ = (s0, a0, s1, a1, …), the probability of choosing this trajectory as
∞
pθ (τ) := ρ (s0) πθ (at | st) P (st+1 | st, at)
∏
t=0
How to compute the gradient ∇θ J(πθ)?
▪ For every random trajectory τ = (s0, a0, s1, a1, …), the probability of choosing this trajectory as
∞
pθ (τ) := ρ (s0) πθ (at | st) P (st+1 | st, at)
∏
t=0
▪ And we set reward we get from this trajectory τ as
∞
γ tr(st, at)
∑
R (τ) :=
t=0
How to compute the gradient ∇θ J(πθ)?
▪ For every random trajectory τ = (s0, a0, s1, a1, …), the probability of choosing this trajectory as
∞
pθ (τ) := ρ (s0) πθ (at | st) P (st+1 | st, at)
∏
t=0
▪ And we set reward we get from this trajectory τ as
∞
γ tr(st, at)
∑
R (τ) :=
t=0
▪ Then,
∞
[∑ ]
J(πθ) := 𝔼 γ tr(st, at) | s0 = s, π = 𝔼τ∼pθ [R(τ)]
t=0
How to compute the gradient ∇θ J(πθ)?
▪ For every random trajectory τ = (s0, a0, s1, a1, …), the probability of choosing this trajectory as
∞
pθ (τ) := ρ (s0) πθ (at | st) P (st+1 | st, at)
∏
t=0
▪ And we set reward we get from this trajectory τ as
∞
γ tr(st, at)
∑
R (τ) :=
t=0
▪ Then,
∞
[∑ ]
J(πθ) := 𝔼 γ tr(st, at) | s0 = s, π = 𝔼τ∼pθ [R(τ)]
t=0
▪ Therefore,
∇θ J(πθ) = ∇θ 𝔼τ∼pθ[R(τ)]
How to compute the gradient?
Policy gradient theorem:
∞ ∞
[( t=0 ) ( t=0 )]
γ tr(st, at) ×
∑ ∑
∇θ J(πθ) = 𝔼τ∼pθ ∇θ log πθ(at | st)
Proof: Because
∇θ J(πθ) = ∇θ 𝔼τ∼pθ[R(τ)]
How to compute the gradient?
Policy gradient theorem:
∞ ∞
[( t=0 ) ( t=0 )]
γ tr(st, at) ×
∑ ∑
∇θ J(πθ) = 𝔼τ∼pθ ∇θ log πθ(at | st)
Proof: Because
∞
γ tr(st, at)
∑
∇θ J(πθ) = 𝔼τ∼pθ[R(τ) ∇log pθ(τ)], R(τ) =
t=0
How to estimate the gradient?
Monte Carlo approximation:
▪ Consider a random variable X ∼ q.
▪ Given independent and identically distributed X1, . . . , XN ∼ q, we can estimate
1 N
N∑
𝔼[ f(X)] ≈ f(Xi) .
i=1
In Reinforcement learning: We don’t know P, but we can approximate
∞ ∞
[( t=0 ) ( t=0 )]
γ tr(st, at) ×
∑ ∑
∇θ J(πθ) = 𝔼τ∼pθ ∇θ log πθ(at | st)
1 N ∞ ∞
N i=1 ( t=0 ) ( t=0 )
t i i
∑ ∑ ∑
≈ γ r(st , at ) ∇θ log πθ(at | st)
1 N H H
N i=1 ( t=0 ) ( t=0 )
t i i
∑ ∑ ∑
≈ γ r(st , at ) ∇θ log πθ(at | st)
Stochastic policy gradient method
Demonstration of policy gradient method
Using policy gradient method to play CartPole
Pros and cons of reinforcement learning
Pros and cons of reinforcement learning
Pros
▪ General methods for complex tasks
Pros and cons of reinforcement learning
Pros
▪ General methods for complex tasks
▪ Adapt to changing environments
Pros and cons of reinforcement learning
Pros
▪ General methods for complex tasks
▪ Adapt to changing environments
▪ Model free: no need to know dynamic model
Pros and cons of reinforcement learning
Pros
▪ General methods for complex tasks
▪ Adapt to changing environments
▪ Model free: no need to know dynamic model
Pros and cons of reinforcement learning
Pros Cons
▪ General methods for complex tasks
▪ Adapt to changing environments
▪ Model free: no need to know dynamic model
Pros and cons of reinforcement learning
Pros Cons
▪ General methods for complex tasks ▪ Sample inefficiency for model free approach
▪ Adapt to changing environments
▪ Model free: no need to know dynamic model
Pros and cons of reinforcement learning
Pros Cons
▪ General methods for complex tasks ▪ Sample inefficiency for model free approach
▪ Adapt to changing environments ▪ Lack of safety and convergence guarantees
▪ Model free: no need to know dynamic model
Pros and cons of reinforcement learning
Pros Cons
▪ General methods for complex tasks ▪ Sample inefficiency for model free approach
▪ Adapt to changing environments ▪ Lack of safety and convergence guarantees
▪ Model free: no need to know dynamic model ▪ Hard to assign meaningful rewards
Pros and cons of reinforcement learning
Pros Cons
▪ General methods for complex tasks ▪ Sample inefficiency for model free approach
▪ Adapt to changing environments ▪ Lack of safety and convergence guarantees
▪ Model free: no need to know dynamic model ▪ Hard to assign meaningful rewards
Reinforcement learning projects in Sycamore lab
Bachelor projects
▪ Policy optimization for Pacman
Reinforcement learning projects in Sycamore lab
Bachelor projects
▪ Policy optimization for Pacman
Semester and master projects
▪ Safe reinforcement learning
▪ Inverse reinforcement learning
References
▪ [1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
▪ [2] Baker, Bowen, et al. "Emergent tool use from multi-agent autocurricula." arXiv preprint arXiv:1909.07528 (2019).
▪ [3] DeepMind AI Reduces Google Data Centre Cooling Bill by 40%. https://www.deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-
bill-by-40. Accessed 15 Dec. 2022.
▪ [4] Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." nature 529.7587 (2016): 484-489
▪ [5] Silver, David, et al. "Mastering chess and shogi by self-play with a general reinforcement learning algorithm." arXiv preprint arXiv:1712.01815 (2017).
▪ [6] Berner, Christopher, et al. "Dota 2 with large scale deep reinforcement learning." arXiv preprint arXiv:1912.06680 (2019).
▪ [7] Arulkumaran, Kai, Antoine Cully, and Julian Togelius. "Alphastar: An evolutionary computation perspective." Proceedings of the genetic and
evolutionary computation conference companion. 2019.
▪ [8] Akkaya, Ilge, et al. "Solving rubik's cube with a robot hand." arXiv preprint arXiv:1910.07113 (2019).
▪ [9] Fawzi, Alhussein, et al. "Discovering faster matrix multiplication algorithms with reinforcement learning." Nature 610.7930 (2022): 47-53.
▪ [10] “ChatGPT: Optimizing Language Models for Dialogue.” OpenAI, 30 Nov. 2022, https://openai.com/blog/chatgpt/.
▪ [11] Miki, Takahiro, et al. "Learning robust perceptive locomotion for quadrupedal robots in the wild." Science Robotics 7.62 (2022): eabk2822.
▪ [12] Ibrahim, Muhammad Sohail, Wei Dong, and Qiang Yang. "Machine learning driven smart electric power systems: Current trends and new
perspectives." Applied Energy 272 (2020): 115237.
▪ [13] Niao He, Lecture notes on “Introduction to Reinforcement Learning”, ETH Zurich, 2021, https://odi.inf.ethz.ch/files/zinal/Lecture-1-RL-
introduction.pdf
▪ [14] Open AI, “Taxonomy of RL Algorithms”, https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html#citations-below
▪ [15] Agarwal, Alekh, et al. "Optimality and approximation with policy gradient methods in markov decision processes." Conference on Learning Theory.
PMLR, 2020.
▪ [16] Mei, Jincheng, et al. "On the global convergence rates of softmax policy gradient methods." International Conference on Machine Learning. PMLR,
2020.