CSE 411: Machine Learning
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning
Agent
State: s
Actions: a
Reward: r
Environment
▪ Basic idea:
▪ Receive feedback in the form of rewards
▪ Agent’s utility is defined by the reward function
▪ Must (learn to) act so as to maximize expected rewards
▪ All learning is based on observed samples of outcomes!
Example: Learning to Walk
Initial A Learning Trial After Learning [1K Trials]
[Kohl and Stone, ICRA 2004]
The Crawler!
Reinforcement Learning
▪ Still assume a Markov decision process (MDP):
▪ A set of states s ∈ S
▪ A set of actions (per state) A
▪ A model T(s,a,s’)
▪ A reward function R(s,a,s’)
▪ Still looking for a policy π(s)
▪ New twist: don’t know T or R
▪ I.e. we don’t know which states are good or what the actions do
▪ Must actually try actions and states out to learn
Offline (MDPs) vs. Online (RL)
Offline Solution Online Learning
Model-Based Learning
Model-Based Learning
▪ Model-Based Idea:
▪ Learn an approximate model based on experiences
▪ Solve for values as if the learned model were correct
▪ Step 1: Learn empirical MDP model
▪ Count outcomes s’ for each s, a
▪ Normalize to give an estimate of
▪ Discover each when we experience (s, a, s’)
▪ Step 2: Solve the learned MDP
▪ For example, use value iteration, as before
Example: Model-Based Learning
Input Policy π Observed Episodes (Training) Learned Model
Episode 1 Episode 2 T(s,a,s’).
B, east, C, -1 B, east, C, -1 T(B, east, C) = 1.00
A T(C, east, D) = 0.75
C, east, D, -1 C, east, D, -1
T(C, east, A) = 0.25
D, exit, x, +10 D, exit, x, +10 …
B C D
Episode 3 Episode 4 R(s,a,s’).
E R(B, east, C) = -1
E, north, C, -1 E, north, C, -1
R(C, east, D) = -1
C, east, D, -1 C, east, A, -1 R(D, exit, x) = +10
Assume: γ = 1 D, exit, x, +10 A, exit, x, -10 …
Example: Expected Age
Goal: Compute expected age of cs188 students
Known P(A)
Without P(A), instead collect samples [a1, a2, … aN]
Unknown P(A): “Model Based” Unknown P(A): “Model Free”
Why does this Why does this
work? Because work? Because
eventually you samples appear
learn the right with the right
model. frequencies.
Model-Free Learning
Passive Reinforcement Learning
Passive Reinforcement Learning
▪ Simplified task: policy evaluation
▪ Input: a fixed policy π(s)
▪ You don’t know the transitions T(s,a,s’)
▪ You don’t know the rewards R(s,a,s’)
▪ Goal: learn the state values
▪ In this case:
▪ Learner is “along for the ride”
▪ No choice about what actions to take
▪ Just execute the policy and learn from experience
▪ This is NOT offline planning! You actually take actions in the world.
Direct Evaluation
▪ Goal: Compute values for each state under π
▪ Idea: Average together observed sample values
▪ Act according to π
▪ Every time you visit a state, write down what the
sum of discounted rewards turned out to be
▪ Average those samples
▪ This is called direct evaluation
Example: Direct Evaluation
Input Policy π Observed Episodes (Training) Output Values
Episode 1 Episode 2
B, east, C, -1 B, east, C, -1 -10
A A
C, east, D, -1 C, east, D, -1
D, exit, x, +10 D, exit, x, +10 +8 +4 +10
B C D B C D
Episode 3 Episode 4 -2
E E
E, north, C, -1 E, north, C, -1
C, east, D, -1 C, east, A, -1
Assume: γ = 1 D, exit, x, +10 A, exit, x, -10
Problems with Direct Evaluation
▪ What’s good about direct evaluation? Output Values
▪ It’s easy to understand
▪ It doesn’t require any knowledge of T, R -10
A
▪ It eventually computes the correct average values,
using just sample transitions +8 +4 +10
B C D
▪ What bad about it? -2
E
▪ It wastes information about state connections
▪ Each state must be learned separately If B and E both go to C
under this policy, how can
▪ So, it takes a long time to learn their values be different?
Why Not Use Policy Evaluation?
▪ Simplified Bellman updates calculate V for a fixed policy: s
▪ Each round, replace V with a one-step-look-ahead layer over V
π(s)
s, π(s)
s, π(s),s’
’s
▪ This approach fully exploited the connections between the states
▪ Unfortunately, we need T and R to do it!
▪ Key question: how can we do this update to V without knowing T and R?
▪ In other words, how to we take a weighted average without knowing the weights?
Sample-Based Policy Evaluation?
▪ We want to improve our estimate of V by computing these averages:
▪ Idea: Take samples of outcomes s’ (by doing the action!) and average
s
π(s)
s, π(s)
s, π(s),s’
's2 's 1 's3
Almost! But we can’t
rewind time to get sample
after sample from state s.
Temporal Difference Learning
Temporal Difference Learning
▪ Big idea: learn from every experience! s
▪ Update V(s) each time we experience a transition (s, a, s’, r)
π(s)
▪ Likely outcomes s’ will contribute updates more often
s, π(s)
▪ Temporal difference learning of values
▪ Policy still fixed, still doing evaluation! ’s
▪ Move values toward value of whatever successor occurs: running average
Sample of V(s):
Update to V(s):
Same update:
Exponential Moving Average
▪ Exponential moving average
▪ The running interpolation update:
▪ Makes recent samples more important:
▪ Forgets about the past (distant past values were wrong anyway)
▪ Decreasing learning rate (alpha) can give converging averages
Example: Temporal Difference Learning
States Observed Transitions
B, east, C, -2 C, east, D, -2
A 0 0 0
B C D 0 0 8 -1 0 8 -1 3 8
E 0 0 0
Assume: γ = 1, α = 1/2
Problems with TD Value Learning
▪ TD value leaning is a model-free way to do policy evaluation, mimicking
Bellman updates with running sample averages
▪ However, if we want to turn values into a (new) policy, we’re sunk:
a
s, a
▪ Idea: learn Q-values, not values s,a,s’
▪ Makes action selection model-free too! ’s
Active Reinforcement Learning
Active Reinforcement Learning
▪ Full reinforcement learning: optimal policies (like value iteration)
▪ You don’t know the transitions T(s,a,s’)
▪ You don’t know the rewards R(s,a,s’)
▪ You choose the actions now
▪ Goal: learn the optimal policy / values
▪ In this case:
▪ Learner makes choices!
▪ Fundamental tradeoff: exploration vs. exploitation
▪ This is NOT offline planning! You actually take actions in the world and find
out what happens…
Detour: Q-Value Iteration
▪ Value iteration: find successive (depth-limited) values
▪ Start with V0(s) = 0, which we know is right
▪ Given Vk, calculate the depth k+1 values for all states:
▪ But Q-values are more useful, so compute them instead
▪ Start with Q0(s,a) = 0, which we know is right
▪ Given Qk, calculate the depth k+1 q-values for all q-states:
Q-Learning
▪ Q-Learning: sample-based Q-value iteration
▪ Learn Q(s,a) values as you go
▪ Receive a sample (s,a,s’,r)
▪ Consider your old estimate:
▪ Consider your new sample estimate:
▪ Incorporate the new estimate into a running average:
[Demo: Q-learning – gridworld (L10D2)]
[Demo: Q-learning – crawler (L10D3)]
Q-Learning Properties
▪ Amazing result: Q-learning converges to optimal policy -- even
if you’re acting suboptimally!
▪ This is called off-policy learning
▪ Caveats:
▪ You have to explore enough
▪ You have to eventually make the learning rate
small enough
▪ … but not decrease it too quickly
▪ Basically, in the limit, it doesn’t matter how you select actions (!)
References
▪ http://ai.berkeley.edu.