KEMBAR78
Actor critic algorithm | PDF
Actor-Critic Algorithm
Jie-Han Chen
NetDB, National Cheng Kung University
5/29, 2018 @ National Cheng Kung University, Taiwan
1
Some content and images in this slides were borrowed from:
1. Sergey Levine’s Deep Reinforcement Learning class in UCB
2. David Silver’s Reinforcement Learning class in UCL
Disclaimer
2
Outline
● Recap policy gradient
● Actor-Critic algorithm
● Recommended Papers
3
Recap: policy gradient
REINFORCE algorithm:
1. sample trajectory from
2.
3.
4
Recap: policy gradient
REINFORCE algorithm:
1. sample trajectory from
2.
3.
Issue: We need to sample whole trajectory to get this
term (Monte Carlo)
Make policy gradient learn
slowly.
5
Recap: policy gradient
REINFORCE algorithm:
1. sample trajectory from
2.
3.
Can we learn step by step?
6
Actor-Critic algorithm
In vanilla policy gradient, we only can evaluate our policy when we finish the whole
episode.
7
Why don’t you tell me how
good is the policy in earlier
stage?
Actor-Critic algorithm
In vanilla policy gradient, we only can evaluate our policy when we finish the whole
episode.
Hello, critic is here. I’ll give
you score in every step.
Why don’t you tell me how
good is the policy in earlier
stage?
8
Actor-Critic algorithm
Objective of vanilla policy gradient:
The return in policy gradient with causality in step t could
be replaced by expected action-value function.
If we could find the action-value function in each step, we
can improve learning efficiency by TD learning.
9
Actor-Critic algorithm
Policy gradient with causality:
Question: How do we get action-value function Q?
10
Actor-Critic algorithm
Policy gradient with causality:
Question: How do we get action-value function Q?
Using another neural network to approximate value
function called Critic. This is so-called Actor-Critic.
By using Critic network, we can update the neural
network step by step. However, it will also introduce
bias.
Policy Network
Critic Network
11
Actor-Critic algorithm
Which kind of format of neural network do we choose in Critic?
12
Actor-Critic algorithm
Which kind of format of neural network do we choose in Critic?
We usually fit the value function. I’ll show you the reason soon. (Other choices are fine)
13
Actor-Critic algorithm
The objective of Actor-Critic
This objective function in this version have lower variance
and higher bias than REINFORCE when we learning by
TD learning.
Can we also subtract a baseline to reduce the variance?
14
Actor-Critic algorithm
The objective of Actor-Critic:
This objective function in this version have lower variance
and higher bias than REINFORCE when we learning by
TD learning.
Can we also subtract a baseline to reduce the variance?
Yes! we could subtract this term:
15
Actor-Critic algorithm
The objective of Actor-Critic with value function baseline:
:
: The average return when other agent face the same state.
: We called this advantage function, which reflects how
good the action we’ve taken compared to other
candidates.
: how good the action we take from current state.
16
Actor-Critic algorithm
t+1
17
Actor-Critic algorithm
18
Actor-Critic algorithm
19
Actor-Critic algorithm
20
Actor-Critic algorithm
21
Actor-Critic algorithm
We just fit the value function!
22
Actor-Critic algorithm: fit value function
Monte Carlo evaluation:
we could sample multiple trajectories like this: Then, compute the loss by supervised regression:
23
Actor-Critic algorithm: fit value function
TD evaluation:
training sample: Then, compute the loss by supervised regression:
24
Actor-Critic algorithm
Online actor-critic algorithm:
1. Take action, get one-step experience (s, a, s’, r)
2. Fit Value function
3. Evaluate advantage function
4.
5.
25
Referenced from CS294 in UCB.
Actor-Critic algorithm
26
Network Architecture
Neural architecture plays an important role in Deep Learning. In Actor-Critic
algorithm, there are two kinds of network architecture:
● Separate policy network and critic network
● Two-head network
27
Network Architecture
Separate policy and critic network
● More parameters
● More stable in training policy network
critic network
28
Two-head network
● Share features, less parameters
● Hard to find good coefficient to balance actor loss and critic loss
Network Architecture
Value head
Policy head
29
AlphaGO
● MCTS, Actor-Critic algorithm
● Separate network architecture
30
AlphaGo Zero
● MCTS, Policy Iteration
● Shared ResNet
31
Correlation Issue
Online actor-critic algorithm:
1. Take action, get one-step experience (s, a, s’, r)
2. Fit Value function
3. Evaluate advantage function
4.
5.
In online actor-critic algorithm, there still exist
correlation problem.
Policy Gradient algorithm is on-policy
algorithm so that we cannot use replay
buffer to solve correlation problem.
Think about how to solve correlation
problem in Actor-Critic.
32
Advance Actor-Critic algorithm
Currently, many state-of-the-art RL algorithms are developed on the basis of
Actor-Critic algorithm:
● Asynchronous Advantage Actor-Critic (A3C)
● Synchronous Advantage Actor-Critic (A2C)
● Trust Region Policy Optimization (TRPO)
● Proximal Policy Optimization (PPO)
● Deep Deterministic Policy Gradient (DDPG)
33
Advance Actor-Critic algorithm
Asynchronous Advantage Actor-Critic (A3C)
V Mnih et al. (ICML 2016) proposed a parallel version of Actor-Critic algorithm which not only solves the
correlation problem but also speeds up learning process. It’s so-called A3C. (Asynchronous Advantage
Actor-Critic), A3C become the state-of-the-art baseline in 2016, 2017
34
Asynchronous Advantage Actor-Critic
They use multiple workers to sample n-step experience.
Each worker have shared global network and their local
network.
Upon collecting enough experience, each worker
computes the gradients of its local network, copies the
gradients to shared global network and then does
backpropagation to update global network
asynchronously.
35
Asynchronous Advantage Actor-Critic
● Asynchronous Advantage Actor-Critic
○ proposed by DeepMind
○ use n-step bootstrapping
○ easier to implement (without lock)
○ some variability in exploration between
workers
○ only use CPUs, poor GPU usage
36
Synchronous Advantage Actor-Critic
● Synchronous Actor-Critic
○ proposed by OpenAI
○ synchronous workers
○ use n-step tricks
○ better GPU usage
37
The difference between A3C and A2C
38
The usage of Actor-Critic algorithm
Gaming, model research Robotics, continuous control
39
The devil is hidden in the details
When we do research on reinforcement
learning, computation resources are needed to
be considered.
A3C/A2C use much memory to cache the status
for each worker in the forward step, especially in
the n-step bootstrapping.
40
From StarCraft II: A New Challenge for Reinforcement Learning
1. Volodymyr Mnih et al. (ICML 2016): A3C, Asynchronous Methods for Deep Reinforcement Learning
2. John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. (ICML 2015): TRPO,
Trust Region Policy Optimization
3. John Schulman et al. (2017): PPO, Proximal Policy Optimization Algorithms
4. Timothy P. Lillicrap et al. (ICLR 2016): DDPG, Continuous control with deep reinforcement learning
5. David Silver, Aja Huang et al. (Nature 2016): AlphaGo, Mastering the Game of Go with Deep Neural
Networks and Tree Search
6. David Silver et al. (Nature 2017): AlphaGo Zero, Mastering the game of Go without human knowledge
7. Emilio Parisotto, Jimmy Lei Ba, Ruslan Salakhutdinov. (ICLR 2016): Actor-Mimic: Deep Multitask and
Transfer Reinforcement Learning
Related Papers
41

Actor critic algorithm

  • 1.
    Actor-Critic Algorithm Jie-Han Chen NetDB,National Cheng Kung University 5/29, 2018 @ National Cheng Kung University, Taiwan 1
  • 2.
    Some content andimages in this slides were borrowed from: 1. Sergey Levine’s Deep Reinforcement Learning class in UCB 2. David Silver’s Reinforcement Learning class in UCL Disclaimer 2
  • 3.
    Outline ● Recap policygradient ● Actor-Critic algorithm ● Recommended Papers 3
  • 4.
    Recap: policy gradient REINFORCEalgorithm: 1. sample trajectory from 2. 3. 4
  • 5.
    Recap: policy gradient REINFORCEalgorithm: 1. sample trajectory from 2. 3. Issue: We need to sample whole trajectory to get this term (Monte Carlo) Make policy gradient learn slowly. 5
  • 6.
    Recap: policy gradient REINFORCEalgorithm: 1. sample trajectory from 2. 3. Can we learn step by step? 6
  • 7.
    Actor-Critic algorithm In vanillapolicy gradient, we only can evaluate our policy when we finish the whole episode. 7 Why don’t you tell me how good is the policy in earlier stage?
  • 8.
    Actor-Critic algorithm In vanillapolicy gradient, we only can evaluate our policy when we finish the whole episode. Hello, critic is here. I’ll give you score in every step. Why don’t you tell me how good is the policy in earlier stage? 8
  • 9.
    Actor-Critic algorithm Objective ofvanilla policy gradient: The return in policy gradient with causality in step t could be replaced by expected action-value function. If we could find the action-value function in each step, we can improve learning efficiency by TD learning. 9
  • 10.
    Actor-Critic algorithm Policy gradientwith causality: Question: How do we get action-value function Q? 10
  • 11.
    Actor-Critic algorithm Policy gradientwith causality: Question: How do we get action-value function Q? Using another neural network to approximate value function called Critic. This is so-called Actor-Critic. By using Critic network, we can update the neural network step by step. However, it will also introduce bias. Policy Network Critic Network 11
  • 12.
    Actor-Critic algorithm Which kindof format of neural network do we choose in Critic? 12
  • 13.
    Actor-Critic algorithm Which kindof format of neural network do we choose in Critic? We usually fit the value function. I’ll show you the reason soon. (Other choices are fine) 13
  • 14.
    Actor-Critic algorithm The objectiveof Actor-Critic This objective function in this version have lower variance and higher bias than REINFORCE when we learning by TD learning. Can we also subtract a baseline to reduce the variance? 14
  • 15.
    Actor-Critic algorithm The objectiveof Actor-Critic: This objective function in this version have lower variance and higher bias than REINFORCE when we learning by TD learning. Can we also subtract a baseline to reduce the variance? Yes! we could subtract this term: 15
  • 16.
    Actor-Critic algorithm The objectiveof Actor-Critic with value function baseline: : : The average return when other agent face the same state. : We called this advantage function, which reflects how good the action we’ve taken compared to other candidates. : how good the action we take from current state. 16
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
    Actor-Critic algorithm We justfit the value function! 22
  • 23.
    Actor-Critic algorithm: fitvalue function Monte Carlo evaluation: we could sample multiple trajectories like this: Then, compute the loss by supervised regression: 23
  • 24.
    Actor-Critic algorithm: fitvalue function TD evaluation: training sample: Then, compute the loss by supervised regression: 24
  • 25.
    Actor-Critic algorithm Online actor-criticalgorithm: 1. Take action, get one-step experience (s, a, s’, r) 2. Fit Value function 3. Evaluate advantage function 4. 5. 25 Referenced from CS294 in UCB.
  • 26.
  • 27.
    Network Architecture Neural architectureplays an important role in Deep Learning. In Actor-Critic algorithm, there are two kinds of network architecture: ● Separate policy network and critic network ● Two-head network 27
  • 28.
    Network Architecture Separate policyand critic network ● More parameters ● More stable in training policy network critic network 28
  • 29.
    Two-head network ● Sharefeatures, less parameters ● Hard to find good coefficient to balance actor loss and critic loss Network Architecture Value head Policy head 29
  • 30.
    AlphaGO ● MCTS, Actor-Criticalgorithm ● Separate network architecture 30
  • 31.
    AlphaGo Zero ● MCTS,Policy Iteration ● Shared ResNet 31
  • 32.
    Correlation Issue Online actor-criticalgorithm: 1. Take action, get one-step experience (s, a, s’, r) 2. Fit Value function 3. Evaluate advantage function 4. 5. In online actor-critic algorithm, there still exist correlation problem. Policy Gradient algorithm is on-policy algorithm so that we cannot use replay buffer to solve correlation problem. Think about how to solve correlation problem in Actor-Critic. 32
  • 33.
    Advance Actor-Critic algorithm Currently,many state-of-the-art RL algorithms are developed on the basis of Actor-Critic algorithm: ● Asynchronous Advantage Actor-Critic (A3C) ● Synchronous Advantage Actor-Critic (A2C) ● Trust Region Policy Optimization (TRPO) ● Proximal Policy Optimization (PPO) ● Deep Deterministic Policy Gradient (DDPG) 33
  • 34.
    Advance Actor-Critic algorithm AsynchronousAdvantage Actor-Critic (A3C) V Mnih et al. (ICML 2016) proposed a parallel version of Actor-Critic algorithm which not only solves the correlation problem but also speeds up learning process. It’s so-called A3C. (Asynchronous Advantage Actor-Critic), A3C become the state-of-the-art baseline in 2016, 2017 34
  • 35.
    Asynchronous Advantage Actor-Critic Theyuse multiple workers to sample n-step experience. Each worker have shared global network and their local network. Upon collecting enough experience, each worker computes the gradients of its local network, copies the gradients to shared global network and then does backpropagation to update global network asynchronously. 35
  • 36.
    Asynchronous Advantage Actor-Critic ●Asynchronous Advantage Actor-Critic ○ proposed by DeepMind ○ use n-step bootstrapping ○ easier to implement (without lock) ○ some variability in exploration between workers ○ only use CPUs, poor GPU usage 36
  • 37.
    Synchronous Advantage Actor-Critic ●Synchronous Actor-Critic ○ proposed by OpenAI ○ synchronous workers ○ use n-step tricks ○ better GPU usage 37
  • 38.
    The difference betweenA3C and A2C 38
  • 39.
    The usage ofActor-Critic algorithm Gaming, model research Robotics, continuous control 39
  • 40.
    The devil ishidden in the details When we do research on reinforcement learning, computation resources are needed to be considered. A3C/A2C use much memory to cache the status for each worker in the forward step, especially in the n-step bootstrapping. 40 From StarCraft II: A New Challenge for Reinforcement Learning
  • 41.
    1. Volodymyr Mnihet al. (ICML 2016): A3C, Asynchronous Methods for Deep Reinforcement Learning 2. John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. (ICML 2015): TRPO, Trust Region Policy Optimization 3. John Schulman et al. (2017): PPO, Proximal Policy Optimization Algorithms 4. Timothy P. Lillicrap et al. (ICLR 2016): DDPG, Continuous control with deep reinforcement learning 5. David Silver, Aja Huang et al. (Nature 2016): AlphaGo, Mastering the Game of Go with Deep Neural Networks and Tree Search 6. David Silver et al. (Nature 2017): AlphaGo Zero, Mastering the game of Go without human knowledge 7. Emilio Parisotto, Jimmy Lei Ba, Ruslan Salakhutdinov. (ICLR 2016): Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning Related Papers 41