Reinforcement learning
Episode 6
Policy gradient methods
1
Small experiment
The next slide contains a question
Please respond as fast as you can!
2
Small experiment
left or right? 3
Small experiment
Right! Ready for next one? 4
Small experiment
What's Q(s,right) under gamma=0.99?
5
Small experiment
What's Q(s,right) under gamma=0.99?
6
Approximation error
DQN is trained to minimize
2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 7
Trivia: Which prediction is better (A/B)?
Approximation error
DQN is trained to minimize
2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 8
better less
policy MSE
Approximation error
DQN is trained to minimize
2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 9
better less
Q-learning will prefer worse policy (B)! policy MSE
Conclusion
● Often computing q-values is harder than
picking optimal actions!
● We could avoid learning value functions by
directly learning agent's policy π θ (a∣s )
Trivia: what algorithm works that way?
(of those we studied) 10
Conclusion
● Often computing q-values is harder than
picking optimal actions!
● We could avoid learning value functions by
directly learning agent's policy π θ (a∣s )
Trivia: what algorithm works that way?
e.g. crossentropy method 11
NOT how humans survived
argmax[
Q(s,pet the tiger)
Q(s,run from tiger)
Q(s,provoke tiger)
Q(s,ignore tiger)
]
12
how humans survived
π (run∣s)=1
13
Policies
In general, two kinds
● Deterministic policy
a=πθ (s )
● Stochastic policy
a∼πθ (a∣s)
14
Trivia: Any case where stochastic is better?
Policies
In general, two kinds
● Deterministic policy
a=πθ (s )
● Stochastic policy
a∼πθ (a∣s) e.g. rock-paper
-scissors
15
Trivia: Any case where stochastic is better?
Policies
In general, two kinds
● Deterministic policy same action each time
Genetic algos (week 0)
Deterministic policy gradient a=πθ (s )
● Stochastic policy sampling takes care
of exploration
Crossentropy method
Policy gradient
a∼πθ (a∣s)
16
Trivia: how to represent policy in continuous action space?
Policies
In general, two kinds
● Deterministic policy same action each time
Genetic algos (week 0)
Deterministic policy gradient a=πθ (s )
● Stochastic policy sampling takes care
of exploration
Crossentropy method
Policy gradient
a∼πθ (a∣s)
17
categorical, normal, mixture of normal, whatever
Two approaches
● Value based:
Learn value function Qθ (s , a) or V θ (s)
Infer policy π (a∣s)=[a=argmax Qθ (s , a)]
a
● Policy based:
Explicitly learn policy π θ (a∣s ) or π θ (s)→a
Implicitly maximize reward over policy
18
Recap: crossentropy method
● Initialize NN weights θ 0 ←random
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
θi +1=θ i +α ∇ ∑ log π θ (ai∣s i)⋅[ si , ai ∈ Elite ]
i
i
Trivia: Can we adapt it to discounted rewards?
(with γ) 19
Recap: crossentropy method
● Initialize NN weights θ 0 ←random
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
θi +1=θ i +α ∇ ∑ log π θ (ai∣s i)⋅[ si , ai ∈ Elite ]
i
i
TD version: elite (s,a) that have highest R(s,a)
20
(select elites independently from each state)
Policy gradient main idea
Why so complicated?
We'd rather simply maximize R over pi!
21
Objective
Expected reward:
J= E R(s , a , s ' , a ' , ...)
s∼ p (s)
a∼π θ (s∣a)
...
Expected discounted reward:
J= E Q(s, a)
s∼ p (s)
a∼π θ (s∣a)
22
Objective
Expected reward: R(z) setting
J= E R(s , a , s ' , a ' , ...)
s∼ p (s)
a∼π θ (s∣a)
...
Expected discounted reward: R(s,a) = r + γ*R(s',a')
J= E Q(s, a)
s∼ p (s)
a∼π θ (s∣a)
“true” Q-function 23
Objective
J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds
s∼ p (s) s a
a∼π θ (s∣a)
24
Objective
Agent's policy
J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds
s∼ p (s) s a
a∼π θ (s∣a)
True action value
state visitation frequency
(may depend on policy)
Trivia: how do we compute that?
25
Objective
J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds
s∼ p (s) s a
a∼π θ (s∣a)
True action value
a.k.a. E[ R(s,a) ]
N
1
J≈
N
∑∑ Q(s , a)
i=0 s ,a ∈ zi
sample N sessions
26
Objective
J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds
s∼ p (s) s a
a∼π θ (s∣a)
True action value
a.k.a. E[ R(s,a) ]
N
1
J≈
N
∑∑ Q(s , a)
i=0 s ,a ∈ zi
sample N sessions
Can we optimize policy now? 27
Objective
J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds
s∼ p (s) s a
a∼π θ (s∣a)
parameters “sit” here
True action value
a.k.a. E[ R(s,a) ]
N
1
J≈
N
∑∑ Q(s , a)
i=0 s ,a ∈ zi
We don't know how to compute dJ/dtheta 28
Optimization
Finite differences
– Change policy a little, evaluate
J θ+ ϵ−J θ
∇ J≈ ϵ
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
29
Optimization
Finite differences
– Change policy a little, evaluate
J θ+ ϵ−J θ
∇ J≈ ϵ
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
Trivia: any problems with those two? 30
Optimization
Finite differences
– Change policy a little, evaluate
J θ+ ϵ−J θ VERY noizy, especially
∇ J≈ ϵ if both J are sampled
Stochastic optimization “quantile convergence”
problems with stochastic
– Good old crossentropy method MDPs
– Maximize probability of “elite” actions
31
Objective
J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds
s∼ p (s) s a
a∼π θ (s∣a)
Wish list:
– Analytical gradient
– Easy/stable approximations
32
Logderivative trick
Simple math
∇ log π ( z )=? ? ?
(try chain rule)
33
Logderivative trick
Simple math
1
∇ log π ( z )= ⋅∇ π( z)
π (z)
π⋅∇ log π( z )=∇ π( z)
34
Policy gradient
Analytical inference
∇ J =∫ p (s)∫ ∇ πθ (a∣s)Q(s , a)da ds
s a
π⋅∇ log π( z )=∇ π( z)
35
Policy gradient
Analytical inference
∇ J =∫ p (s)∫ ∇ πθ (a∣s)Q(s , a)da ds
s a
π⋅∇ log π( z )=∇ π( z)
∇ J =∫ p (s)∫ πθ (a∣s) ∇ log πθ (a∣s)Q (s , a)da ds
s a
36
Trivia: anything curious about that formula?
Policy gradient
Analytical inference
∇ J =∫ p (s)∫ ∇ πθ (a∣s)Q(s , a)da ds
s a
π⋅∇ log π( z )=∇ π( z)
∇ J =∫ p (s)∫ πθ (a∣s) ∇ log πθ (a∣s)Q (s , a)da ds
s a
that's expectation :)
37
Policy gradient
Analytical inference
∇ J =∫ p (s)∫ ∇ πθ (a∣s)Q(s , a)da ds
s a
π⋅∇ log π( z )=∇ π( z)
∇ J= E ∇ log π θ (a∣s)⋅Q(s , a)
s∼ p (s)
a∼π θ (s∣a) 38
Policy gradient (REINFORCE)
● Policy gradient
∇ J= E ∇ log π θ (a∣s )⋅Q(s , a)
s∼ p (s)
a∼π θ (s∣a)
● Approximate with sampling
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s, a)
i=0 s ,a ∈z i
39
REINFORCE algorithm
● Initialize NN weights θ 0 ←random
● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i
– Ascend θi +1 ←θi + α⋅∇ J
40
REINFORCE algorithm
● Initialize NN weights θ 0 ←random
Trivia: is it off- or on-policy?
● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i
– Ascend θi +1 ←θi + α⋅∇ J
41
REINFORCE algorithm
● Initialize NN weights θ 0 ←random
● Loop: actions under current policy
= on-policy
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i
– Ascend θi +1 ←θi + α⋅∇ J
42
value-based Vs policy-based
Value-based Policy-based
● Q-learning, SARSA, MCTS ● REINFORCE, CEM
value-iteration
● Solves harder problem ● Solves easier problem
● Artificial exploration ● Innate exploration
● Learns from partial experience ● Innate stochasticity
(temporal difference) ● Support continuous action space
● Evaluates strategy for free :) ● Learns from full session only
value-based Vs policy-based
Value-based Policy-based
● Q-learning, SARSA, MCTS ● REINFORCE, CEM
value-iteration
We'll learn much more soon!
● Solves harder problem ● Solves easier problem
● Artificial exploration ● Innate exploration
● Learns from partial experience ● Innate stochasticity
(temporal difference) ● Support continuous action space
● Evaluates strategy for free :) ● Learns from full session only
REINFORCE algorithm
● Initialize NN weights θ 0 ←random
● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i
What is better for learning:
random action in good state
or 45
great action in bad state?
REINFORCE baseline
● Initialize NN weights θ 0 ←random
● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i
Q(s,a) = V(s) + A(s,a)
46
Actions influence A(s,a) only, so V(s) is irrelevant
REINFORCE baseline
● Initialize NN weights θ 0 ←random
● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅(Q (s , a)−b(s))
i=0 s ,a ∈z i
Anything that doesn't depend on action
47
ideally, b(s) = V(s)
Actor-critic
● Learn both V(s) and π θ (a∣s )
● Hope for best of both worlds :)
48
Advantage actor-critic
Idea: learn both π θ (a∣s ) and V θ (s)
Use V θ (s) to learn π θ (a∣s ) faster!
Non-trivia: how can we estimate A(s,a)
from (s,a,r,s') and V-function?
49
Advantage actor-critic
Idea: learn both π θ (a∣s ) and V θ (s)
Use V θ (s) to learn π θ (a∣s ) faster!
A(s , a)=Q (s , a)−V (s )
Q(s , a)=r +γ⋅V (s ')
A(s , a)=r + γ⋅V (s ')−V (s)
50
Advantage actor-critic
Idea: learn both π θ (a∣s ) and V θ (s)
Use V θ (s) to learn π θ (a∣s ) faster!
A(s , a)=Q (s , a)−V (s )
Also: n-step
Q(s , a)=r +γ⋅V (s ') version
A(s , a)=r + γ⋅V (s ')−V (s)
51
Advantage actor-critic
Idea: learn both π θ (a∣s ) and V θ (s)
Use V θ (s) to learn π θ (a∣s ) faster!
A(s , a)=r + γ⋅V (s ')−V (s)
N
1
∇ J actor ≈
N
∑∑ ∇ log π θ (a∣s)⋅A (s , a)
i=0 s ,a ∈ zi
consider
const
Trivia: how do we train V then? 52
Advantage actor-critic
π θ (a∣s ) V θ (s)
Improve policy:
N
1
model
∇ J actor ≈
N
∑∑ ∇ log π θ (a∣s)⋅A (s , a)
i=0 s ,a ∈ zi
W = params
Improve value:
N
1
Lcritic ≈
N
∑ ∑ (V θ (s)−[r +γ⋅V (s ')]) 2
i=0 s ,a ∈ zi
state s
53
Continuous action spaces
What if there's continuously many actions?
● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?
54
Continuous action spaces
What if there's continuously many actions?
● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?
it doesn't :)
Just plug in a different formula for
pi(a|s), e.g. normal distribution 55
Duct tape zone
● V(s) errors less important than in Q-learning
– actor still learns even if critic is random, just slower
● Regularize with entropy
– to prevent premature convergence
● Learn on parallel sessions
– Or super-small experience replay
● Use logsoftmax for numerical stability 56
Let's code!
57