KEMBAR78
5 - Policy Gradient Methods | PDF | Mathematical Optimization | Applied Mathematics
0% found this document useful (0 votes)
30 views57 pages

5 - Policy Gradient Methods

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views57 pages

5 - Policy Gradient Methods

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Reinforcement learning

Episode 6

Policy gradient methods

1
Small experiment

The next slide contains a question

Please respond as fast as you can!

2
Small experiment

left or right? 3
Small experiment

Right! Ready for next one? 4


Small experiment

What's Q(s,right) under gamma=0.99?


5
Small experiment

What's Q(s,right) under gamma=0.99?


6
Approximation error
DQN is trained to minimize

2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 7
Trivia: Which prediction is better (A/B)?
Approximation error
DQN is trained to minimize

2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 8
better less
policy MSE
Approximation error
DQN is trained to minimize

2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 9
better less
Q-learning will prefer worse policy (B)! policy MSE
Conclusion

● Often computing q-values is harder than


picking optimal actions!

● We could avoid learning value functions by


directly learning agent's policy π θ (a∣s )

Trivia: what algorithm works that way?


(of those we studied) 10
Conclusion

● Often computing q-values is harder than


picking optimal actions!

● We could avoid learning value functions by


directly learning agent's policy π θ (a∣s )

Trivia: what algorithm works that way?


e.g. crossentropy method 11
NOT how humans survived
argmax[
Q(s,pet the tiger)
Q(s,run from tiger)
Q(s,provoke tiger)
Q(s,ignore tiger)
]

12
how humans survived

π (run∣s)=1

13
Policies
In general, two kinds

● Deterministic policy

a=πθ (s )
● Stochastic policy

a∼πθ (a∣s)
14
Trivia: Any case where stochastic is better?
Policies
In general, two kinds

● Deterministic policy

a=πθ (s )
● Stochastic policy

a∼πθ (a∣s) e.g. rock-paper


-scissors
15
Trivia: Any case where stochastic is better?
Policies
In general, two kinds

● Deterministic policy same action each time


Genetic algos (week 0)
Deterministic policy gradient a=πθ (s )
● Stochastic policy sampling takes care
of exploration
Crossentropy method
Policy gradient
a∼πθ (a∣s)
16
Trivia: how to represent policy in continuous action space?
Policies
In general, two kinds

● Deterministic policy same action each time


Genetic algos (week 0)
Deterministic policy gradient a=πθ (s )
● Stochastic policy sampling takes care
of exploration
Crossentropy method
Policy gradient
a∼πθ (a∣s)
17
categorical, normal, mixture of normal, whatever
Two approaches
● Value based:

Learn value function Qθ (s , a) or V θ (s)

Infer policy π (a∣s)=[a=argmax Qθ (s , a)]


a
● Policy based:

Explicitly learn policy π θ (a∣s ) or π θ (s)→a

Implicitly maximize reward over policy


18
Recap: crossentropy method
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate

θi +1=θ i +α ∇ ∑ log π θ (ai∣s i)⋅[ si , ai ∈ Elite ]


i
i

Trivia: Can we adapt it to discounted rewards?


(with γ) 19
Recap: crossentropy method
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate

θi +1=θ i +α ∇ ∑ log π θ (ai∣s i)⋅[ si , ai ∈ Elite ]


i
i

TD version: elite (s,a) that have highest R(s,a)


20
(select elites independently from each state)
Policy gradient main idea

Why so complicated?
We'd rather simply maximize R over pi!

21
Objective

Expected reward:
J= E R(s , a , s ' , a ' , ...)
s∼ p (s)
a∼π θ (s∣a)
...

Expected discounted reward:

J= E Q(s, a)
s∼ p (s)
a∼π θ (s∣a)

22
Objective

Expected reward: R(z) setting


J= E R(s , a , s ' , a ' , ...)
s∼ p (s)
a∼π θ (s∣a)
...

Expected discounted reward: R(s,a) = r + γ*R(s',a')

J= E Q(s, a)
s∼ p (s)
a∼π θ (s∣a)
“true” Q-function 23
Objective

J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

24
Objective
Agent's policy

J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

True action value


state visitation frequency
(may depend on policy)

Trivia: how do we compute that?


25
Objective

J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

True action value


a.k.a. E[ R(s,a) ]
N
1
J≈
N
∑∑ Q(s , a)
i=0 s ,a ∈ zi

sample N sessions
26
Objective

J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

True action value


a.k.a. E[ R(s,a) ]
N
1
J≈
N
∑∑ Q(s , a)
i=0 s ,a ∈ zi

sample N sessions

Can we optimize policy now? 27


Objective

J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

parameters “sit” here


True action value
a.k.a. E[ R(s,a) ]
N
1
J≈
N
∑∑ Q(s , a)
i=0 s ,a ∈ zi

We don't know how to compute dJ/dtheta 28


Optimization
Finite differences
– Change policy a little, evaluate

J θ+ ϵ−J θ
∇ J≈ ϵ

Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions

29
Optimization
Finite differences
– Change policy a little, evaluate

J θ+ ϵ−J θ
∇ J≈ ϵ

Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions

Trivia: any problems with those two? 30


Optimization
Finite differences
– Change policy a little, evaluate

J θ+ ϵ−J θ VERY noizy, especially


∇ J≈ ϵ if both J are sampled

Stochastic optimization “quantile convergence”


problems with stochastic
– Good old crossentropy method MDPs

– Maximize probability of “elite” actions

31
Objective

J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

Wish list:
– Analytical gradient
– Easy/stable approximations

32
Logderivative trick
Simple math

∇ log π ( z )=? ? ?

(try chain rule)

33
Logderivative trick
Simple math

1
∇ log π ( z )= ⋅∇ π( z)
π (z)

π⋅∇ log π( z )=∇ π( z)

34
Policy gradient
Analytical inference

∇ J =∫ p (s)∫ ∇ πθ (a∣s)Q(s , a)da ds


s a

π⋅∇ log π( z )=∇ π( z)

35
Policy gradient
Analytical inference

∇ J =∫ p (s)∫ ∇ πθ (a∣s)Q(s , a)da ds


s a

π⋅∇ log π( z )=∇ π( z)

∇ J =∫ p (s)∫ πθ (a∣s) ∇ log πθ (a∣s)Q (s , a)da ds


s a

36
Trivia: anything curious about that formula?
Policy gradient
Analytical inference

∇ J =∫ p (s)∫ ∇ πθ (a∣s)Q(s , a)da ds


s a

π⋅∇ log π( z )=∇ π( z)

∇ J =∫ p (s)∫ πθ (a∣s) ∇ log πθ (a∣s)Q (s , a)da ds


s a
that's expectation :)
37
Policy gradient
Analytical inference

∇ J =∫ p (s)∫ ∇ πθ (a∣s)Q(s , a)da ds


s a

π⋅∇ log π( z )=∇ π( z)

∇ J= E ∇ log π θ (a∣s)⋅Q(s , a)
s∼ p (s)
a∼π θ (s∣a) 38
Policy gradient (REINFORCE)
● Policy gradient

∇ J= E ∇ log π θ (a∣s )⋅Q(s , a)


s∼ p (s)
a∼π θ (s∣a)

● Approximate with sampling


N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s, a)
i=0 s ,a ∈z i

39
REINFORCE algorithm
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

– Ascend θi +1 ←θi + α⋅∇ J


40
REINFORCE algorithm
● Initialize NN weights θ 0 ←random
Trivia: is it off- or on-policy?
● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

– Ascend θi +1 ←θi + α⋅∇ J


41
REINFORCE algorithm
● Initialize NN weights θ 0 ←random

● Loop: actions under current policy


= on-policy
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

– Ascend θi +1 ←θi + α⋅∇ J


42
value-based Vs policy-based
Value-based Policy-based

● Q-learning, SARSA, MCTS ● REINFORCE, CEM


value-iteration

● Solves harder problem ● Solves easier problem


● Artificial exploration ● Innate exploration
● Learns from partial experience ● Innate stochasticity
(temporal difference) ● Support continuous action space
● Evaluates strategy for free :) ● Learns from full session only
value-based Vs policy-based
Value-based Policy-based

● Q-learning, SARSA, MCTS ● REINFORCE, CEM


value-iteration
We'll learn much more soon!

● Solves harder problem ● Solves easier problem


● Artificial exploration ● Innate exploration
● Learns from partial experience ● Innate stochasticity
(temporal difference) ● Support continuous action space
● Evaluates strategy for free :) ● Learns from full session only
REINFORCE algorithm
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

What is better for learning:


random action in good state
or 45
great action in bad state?
REINFORCE baseline
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

Q(s,a) = V(s) + A(s,a)


46
Actions influence A(s,a) only, so V(s) is irrelevant
REINFORCE baseline
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅(Q (s , a)−b(s))
i=0 s ,a ∈z i

Anything that doesn't depend on action


47
ideally, b(s) = V(s)
Actor-critic
● Learn both V(s) and π θ (a∣s )
● Hope for best of both worlds :)

48
Advantage actor-critic

Idea: learn both π θ (a∣s ) and V θ (s)

Use V θ (s) to learn π θ (a∣s ) faster!

Non-trivia: how can we estimate A(s,a)


from (s,a,r,s') and V-function?

49
Advantage actor-critic

Idea: learn both π θ (a∣s ) and V θ (s)

Use V θ (s) to learn π θ (a∣s ) faster!

A(s , a)=Q (s , a)−V (s )

Q(s , a)=r +γ⋅V (s ')

A(s , a)=r + γ⋅V (s ')−V (s)


50
Advantage actor-critic

Idea: learn both π θ (a∣s ) and V θ (s)

Use V θ (s) to learn π θ (a∣s ) faster!

A(s , a)=Q (s , a)−V (s )


Also: n-step
Q(s , a)=r +γ⋅V (s ') version

A(s , a)=r + γ⋅V (s ')−V (s)


51
Advantage actor-critic

Idea: learn both π θ (a∣s ) and V θ (s)

Use V θ (s) to learn π θ (a∣s ) faster!

A(s , a)=r + γ⋅V (s ')−V (s)


N
1
∇ J actor ≈
N
∑∑ ∇ log π θ (a∣s)⋅A (s , a)
i=0 s ,a ∈ zi
consider
const

Trivia: how do we train V then? 52


Advantage actor-critic

π θ (a∣s ) V θ (s)
Improve policy:
N
1
model
∇ J actor ≈
N
∑∑ ∇ log π θ (a∣s)⋅A (s , a)
i=0 s ,a ∈ zi
W = params
Improve value:
N
1
Lcritic ≈
N
∑ ∑ (V θ (s)−[r +γ⋅V (s ')]) 2

i=0 s ,a ∈ zi
state s

53
Continuous action spaces

What if there's continuously many actions?


● Robot control: control motor voltage
● Trading: assign money to equity

How does the algorithm change?

54
Continuous action spaces

What if there's continuously many actions?


● Robot control: control motor voltage
● Trading: assign money to equity

How does the algorithm change?


it doesn't :)
Just plug in a different formula for
pi(a|s), e.g. normal distribution 55
Duct tape zone
● V(s) errors less important than in Q-learning
– actor still learns even if critic is random, just slower

● Regularize with entropy


– to prevent premature convergence

● Learn on parallel sessions


– Or super-small experience replay

● Use logsoftmax for numerical stability 56


Let's code!

57

You might also like