Actor critic algorithm

Actor-Critic Algorithm
Jie-Han Chen
NetDB, National Cheng Kung University
5/29, 2018 @ National Cheng Kung University, Taiwan
1

Some content and images in this slides were borrowed from:
1. Sergey Levine’s Deep Reinforcement Learning class in UCB
2. David Silver’s Reinforcement Learning class in UCL
Disclaimer
2

Outline
● Recap policy gradient
● Actor-Critic algorithm
● Recommended Papers
3

Recap: policy gradient
REINFORCE algorithm:
1. sample trajectory from
2.
3.
4

2.
3.
Issue: We need to sample whole trajectory to get this
term (Monte Carlo)
Make policy gradient learn
slowly.
5

2.
3.
Can we learn step by step?
6

Actor-Critic algorithm
In vanilla policy gradient, we only can evaluate our policy when we finish the whole
episode.
7
Why don’t you tell me how
good is the policy in earlier
stage?

In vanilla policy gradient, we only can evaluate our policy when we finish the whole
episode.
Hello, critic is here. I’ll give
you score in every step.
Why don’t you tell me how
good is the policy in earlier
stage?
8

Objective of vanilla policy gradient:
The return in policy gradient with causality in step t could
be replaced by expected action-value function.
If we could find the action-value function in each step, we
can improve learning efficiency by TD learning.
9

Policy gradient with causality:
Question: How do we get action-value function Q?
10

Policy gradient with causality:
Question: How do we get action-value function Q?
Using another neural network to approximate value
function called Critic. This is so-called Actor-Critic.
By using Critic network, we can update the neural
network step by step. However, it will also introduce
bias.
Policy Network
Critic Network
11

Which kind of format of neural network do we choose in Critic?
12

Which kind of format of neural network do we choose in Critic?
We usually fit the value function. I’ll show you the reason soon. (Other choices are fine)
13

The objective of Actor-Critic
This objective function in this version have lower variance
and higher bias than REINFORCE when we learning by
TD learning.
Can we also subtract a baseline to reduce the variance?
14

The objective of Actor-Critic:
This objective function in this version have lower variance
and higher bias than REINFORCE when we learning by
TD learning.
Can we also subtract a baseline to reduce the variance?
Yes! we could subtract this term:
15

The objective of Actor-Critic with value function baseline:
:
: The average return when other agent face the same state.
: We called this advantage function, which reflects how
good the action we’ve taken compared to other
candidates.
: how good the action we take from current state.
16

We just fit the value function!
22

Actor-Critic algorithm: fit value function
Monte Carlo evaluation:
we could sample multiple trajectories like this: Then, compute the loss by supervised regression:
23

Actor-Critic algorithm: fit value function
TD evaluation:
training sample: Then, compute the loss by supervised regression:
24

Online actor-critic algorithm:
1. Take action, get one-step experience (s, a, s’, r)
2. Fit Value function
3. Evaluate advantage function
4.
5.
25
Referenced from CS294 in UCB.

Network Architecture
Neural architecture plays an important role in Deep Learning. In Actor-Critic
algorithm, there are two kinds of network architecture:
● Separate policy network and critic network
● Two-head network
27

Separate policy and critic network
● More parameters
● More stable in training policy network
critic network
28

Two-head network
● Share features, less parameters
● Hard to find good coefficient to balance actor loss and critic loss
Value head
Policy head
29

AlphaGO
● MCTS, Actor-Critic algorithm
● Separate network architecture
30

AlphaGo Zero
● MCTS, Policy Iteration
● Shared ResNet
31

Correlation Issue
Online actor-critic algorithm:
1. Take action, get one-step experience (s, a, s’, r)
2. Fit Value function
3. Evaluate advantage function
4.
5.
In online actor-critic algorithm, there still exist
correlation problem.
Policy Gradient algorithm is on-policy
algorithm so that we cannot use replay
buffer to solve correlation problem.
Think about how to solve correlation
problem in Actor-Critic.
32

Advance Actor-Critic algorithm
Currently, many state-of-the-art RL algorithms are developed on the basis of
Actor-Critic algorithm:
● Asynchronous Advantage Actor-Critic (A3C)
● Synchronous Advantage Actor-Critic (A2C)
● Trust Region Policy Optimization (TRPO)
● Proximal Policy Optimization (PPO)
● Deep Deterministic Policy Gradient (DDPG)
33

Advance Actor-Critic algorithm
Asynchronous Advantage Actor-Critic (A3C)
V Mnih et al. (ICML 2016) proposed a parallel version of Actor-Critic algorithm which not only solves the
correlation problem but also speeds up learning process. It’s so-called A3C. (Asynchronous Advantage
Actor-Critic), A3C become the state-of-the-art baseline in 2016, 2017
34

Asynchronous Advantage Actor-Critic
They use multiple workers to sample n-step experience.
Each worker have shared global network and their local
network.
Upon collecting enough experience, each worker
computes the gradients of its local network, copies the
gradients to shared global network and then does
backpropagation to update global network
asynchronously.
35

Asynchronous Advantage Actor-Critic
● Asynchronous Advantage Actor-Critic
○ proposed by DeepMind
○ use n-step bootstrapping
○ easier to implement (without lock)
○ some variability in exploration between
workers
○ only use CPUs, poor GPU usage
36

Synchronous Advantage Actor-Critic
● Synchronous Actor-Critic
○ proposed by OpenAI
○ synchronous workers
○ use n-step tricks
○ better GPU usage
37

The difference between A3C and A2C
38

The usage of Actor-Critic algorithm
Gaming, model research Robotics, continuous control
39

The devil is hidden in the details
When we do research on reinforcement
learning, computation resources are needed to
be considered.
A3C/A2C use much memory to cache the status
for each worker in the forward step, especially in
the n-step bootstrapping.
40
From StarCraft II: A New Challenge for Reinforcement Learning

1. Volodymyr Mnih et al. (ICML 2016): A3C, Asynchronous Methods for Deep Reinforcement Learning
2. John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. (ICML 2015): TRPO,
Trust Region Policy Optimization
3. John Schulman et al. (2017): PPO, Proximal Policy Optimization Algorithms
4. Timothy P. Lillicrap et al. (ICLR 2016): DDPG, Continuous control with deep reinforcement learning
5. David Silver, Aja Huang et al. (Nature 2016): AlphaGo, Mastering the Game of Go with Deep Neural
Networks and Tree Search
6. David Silver et al. (Nature 2017): AlphaGo Zero, Mastering the game of Go without human knowledge
7. Emilio Parisotto, Jimmy Lei Ba, Ruslan Salakhutdinov. (ICLR 2016): Actor-Mimic: Deep Multitask and
Transfer Reinforcement Learning
Related Papers
41

Actor critic algorithm

More Related Content

What's hot

Similar to Actor critic algorithm

More from Jie-Han Chen

Recently uploaded

Actor critic algorithm