KEMBAR78
RL - 01 Introduction To Reinforcement Learning | PDF | Learning | Cognition
0% found this document useful (0 votes)
10 views62 pages

RL - 01 Introduction To Reinforcement Learning

This document provides an introduction to reinforcement learning (RL), outlining its core concepts, components, and challenges. It emphasizes the importance of learning through interaction with the environment, the role of rewards, and the distinction between various learning paradigms. The lecture also discusses the structure of RL systems, including agents, policies, value functions, and models, while highlighting the differences between model-free and model-based approaches.

Uploaded by

g46rjf5cps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views62 pages

RL - 01 Introduction To Reinforcement Learning

This document provides an introduction to reinforcement learning (RL), outlining its core concepts, components, and challenges. It emphasizes the importance of learning through interaction with the environment, the role of rewards, and the distinction between various learning paradigms. The lecture also discusses the structure of RL systems, including agents, policies, value functions, and models, while highlighting the differences between model-free and model-based approaches.

Uploaded by

g46rjf5cps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Reinforcement learning

Lecture 1: Introduction

Hado van Hasselt


hado@google.com

Advanced deep learning and reinforcement learning, UCL


January 18, 2018
Admin

I RL lectures: mostly Thursday 9-11am, some exceptions


I Check Moodle for updates
I Use Moodle for questions
I Grading: assignments
I Background material:
Reinforcement Learning: An Introduction, Sutton & Barto 2018
http://incompleteideas.net/book/the-book-2nd.html
Background for this lecture: chapters 1 and 3
Outline

What is reinforcement learning?

Core concepts

Agent components

Challenges in reinforcement learning


What is reinforcement learning?
Motivation

I First, automation of repeated physical solutions


I Industrial revolution (1750 - 1850) and Machine Age (1870 - 1940)
I Second, automation of repeated mental solutions
I Digital revolution (1960 - now) and Information Age
I Next step: allow machines to find solutions themselves
I AI revolution (now - ????)
I This requires learning autonomously how to make decisions
What is Reinforcement Learning?

I We, and other intelligent beings, learn by interacting with our environment
I This differs from certain other types of learning
I It is active rather than passive
I Interactions are often sequential — future interactions can depend on earlier ones
I We are goal-directed
I We can learn without examples of optimal behaviour
The Interaction Loop
What is Reinforcement Learning?

There are (at least) two distinct reasons to learn:


1. Find previously unknown solutions
E.g., a program that can play Go better than any human, ever
2. Find solutions online, for unforeseen circumstances
E.g., a robot that can navigate terrains that differ greatly from any expected terrain

I Reinforcement learning seeks to provide algorithms for both cases


I Note that the second point is not (just) about generalization — it is about
learning efficiently online, during operation
What is Reinforcement Learning?

I Science of learning to make decisions from interaction


I This requires us to think about
I ...time
I ...(long-term) consequences of actions
I ...actively gathering experience
I ...predicting the future
I ...dealing with uncertainty
I Huge potential scope

RL = AI?
Related Disciplines
Computer Science

Engineering Neuroscience
Machine
Learning
Optimal Reward
Control System
Reinforcement
Learning
Operations Classical/Operant
Research Conditioning
Bounded
Mathematics Psychology
Rationality

Economics
Branches of Machine Learning

Supervised Unsupervised
Learning Learning

Machine
Learning

Reinforcement
Learning
Characteristics of Reinforcement Learning

How does reinforcement learning differ from other machine learning paradigms?
I No supervision, only a reward signal
I Feedback can be delayed, not instantaneous
I Time matters
I Earlier decisions affect later interactions
Examples of decision problems

I Examples:
I Fly a helicopter
I Manage an investment portfolio
I Control a power station
I Make a robot walk
I Play video or board games
I These are all reinforcement learning problems
(no matter which solution method you use)
Video

Atari
Core concepts

Core concepts of a reinforcement learning system are:


I Environment
I Reward signal
I Agent, containing:
I Agent state
I Policy
I Value function (probably)
I Model (optionally)
Agent and Environment

I At each step t the agent:


I Receives observation Ot (and reward Rt )
I Executes action At
I The environment:
I Receives action At
I Emits observation Ot+1 (and reward Rt+1 )
Rewards

I A reward Rt is a scalar feedback signal


I Indicates how well agent is doing at step t — defines the goal
I The agent’s job is to maximize cumulative reward

Gt = Rt+1 + Rt+2 + Rt+3 + ...

I We call this the return


Reinforcement learning is based on the reward hypothesis
Definition (Reward Hypothesis)
Any goal can be formalized as the outcome of maximizing a cumulative reward
Do you agree?
Values

I We call the expected cumulative reward, from a state s, the value

v (s) = E [Gt | St = s]
= E [Rt+1 + Rt+2 + Rt+3 + ... | St = s]

I Goal is then to maximize value, by picking suitable actions


I Rewards and values define desirability of a state or action (no supervised feedback)
I Note that returns and values can be defined recursively

Gt = Rt+1 + Gt+1
Actions in sequential problems

I Goal: select actions to maximise value


I Actions may have long term consequences
I Reward may be delayed
I It may be better to sacrifice immediate reward to gain more long-term reward
I Examples:
I A financial investment (may take months to mature)
I Refueling a helicopter (might prevent a crash in several hours)
I Blocking opponent moves (might help winning chances many moves from now)
I A mapping from states to actions is called a policy
Action values

I It is possible to condition the value on actions:

q(s, a) = E [Gt | St = s, At = a]
= E [Rt+1 + Rt+2 + Rt+3 + ... | St = s, At = a]

I We will talk in depth about state and action values later


Agent components

Agent components
I Agent state
I Policy
I Value function
I Model
State

I Actions depend on the state of the agent


I Both agent and environment may have an internal state
I In the simplest case, there is only one state (next lecture)
I Often, there are many different states — sometimes infinitely many
I The state of the agent generally differs from the state of the environment
I The agent may not even know the full state of the environment
Environment State

I The environment state is the environment’s


internal state
I It is not usually visible to the agent
I Even if it is visible, it may contain lots of
irrelevant information
Agent State

I A history is a sequence of observations, actions, rewards

Ht = O0 , A0 , R1 , O1 , ..., Ot−1 , At−1 , Rt , Ot

I For instance, the sensorimotor stream of a robot


I This history can be used to construct an agent state St
I Actions depend on this state
Fully Observable Environments

Full observability:
Suppose the agent sees the full environment state
I observation = environment state
I The agent state could just be this observation:

St = Ot = environment state

I Then the agent is in a Markov decision process


Markov decision processes
Markov decision processes (MDPs) provide a useful mathematical framework
Definition
A decision process is Markov if

p (r , s | St , At ) = p (r , s | Ht , At )

I “The future is independent of the past given the present”

Ht → St → Ht+1

I Once the state is known, the history may be thrown away


I The environment state is typically Markov
I The history Ht is Markov
Partially Observable Environments

I Partial observability: The agent gets partial information


I A robot with camera vision isn’t told its absolute location
I A poker playing agent only observes public cards
I Now the observation is not Markov
I Formally this is a partially observable Markov decision process (POMDP)
I The environment state can still be Markov, but the agent does not know it
Agent State

I The agent state is a function of the history


I The agent’s action depends on its state
I For instance, St = Ot
I More generally:

St+1 = f (St , At , Rt+1 , Ot+1 )

where f is a ‘state update function‘


I The agent state is typically much smaller than
the environment state
Agent State
The full environment state of a maze
Agent State
A potential observation
Agent State
An observation in a different location
Agent State
The two observations are indistinguishable
Agent State
These two states are not Markov

How could you construct a Markov agent state in this maze (for any reward signal)?
Partially Observable Environments

I To deal with partial observability, agent can construct suitable state


representations
I Examples of agent states:
I Last observation: St = Ot (might not be enough)
I Complete history: St = Ht (might be too large)
I Some incrementally updated state: St = f (St−1 , Ot )
(E.g., implemented with a recurrent neural network.)
(Sometimes called ‘memory‘.)
I Constructing a Markov agent state may not be feasible; this is common!
I More importantly, the should state be contain enough informative for good
policies, and/or good value predictions
Agent components

Agent components
I Agent state
I Policy
I Value function
I Model
Policy

I A policy defines the agent’s behaviour


I It is a map from agent state to action
I Deterministic policy: A = π(S)
I Stochastic policy: π(A|S) = p (A|S)
Agent components

Agent components
I Agent state
I Policy
I Value function
I Model
Value Function

I The actual value function is the expected return

vπ (s) = E [Gt | St = s, π]
= E Rt+1 + γRt+2 + γ 2 Rt+3 + ... | St = s, π
 

I We introduced a discount factor γ ∈ [0, 1]


I Trades off importance of immediate vs long-term rewards
I The value depends on a policy
I Can be used to evaluate the desirability of states
I Can be used to select between actions
Value Functions
I The return has a recursive form Gt = Rt+1 + γGt+1
I Therefore, the value has as well

vπ (s) = E [Rt+1 + γGt+1 | St = s, At ∼ π(s)]


= E [Rt+1 + γvπ (St+1 ) | St = s, At ∼ π(s)]

Here a ∼ π(s) means a is chosen by policy π in state s (even if π is deterministic)


I This is known as a Bellman equation (Bellman 1957)
I A similar equation holds for the optimal (=highest possible) value:

v∗ (s) = max E [Rt+1 + γv∗ (St+1 ) | St = s, At = a]


a

This does not depend on a policy


I We heavily exploit such equalities, and use them to create algorithms
Value Function approximations

I Agents often approximate value functions


I We will discuss algorithms to learn these efficiently
I With an accurate value function, we can behave optimally
I With suitable approximations, we can behave well, even in intractably big domains
Agent components

Agent components
I Agent state
I Policy
I Value function
I Model
Model

I A model predicts what the environment will do next


I E.g., P predicts the next state

P(s, a, s 0 ) ≈ p St+1 = s 0 | St = s, At = a


I E.g., R predicts the next (immediate) reward

R(s, a) ≈ E [Rt+1 | St = s, At = a]

I A model does not immediately give us a good policy - we would still need to plan
I We could also consider stochastic (generative) models
Maze Example

Start

I Rewards: -1 per time-step


I Actions: N, E, S, W
I States: Agent’s location

Goal
Maze Example: Policy

Start

Goal

I Arrows represent policy π(s) for each state s


Maze Example: Value Function

-14 -13 -12 -11 -10 -9

Start -16 -15 -12 -8

-16 -17 -6 -7

-18 -19 -5

-24 -20 -4 -3

-23 -22 -21 -22 -2 -1 Goal

I Numbers represent value vπ (s) of each state s


Maze Example: Model

-1 -1 -1 -1 -1 -1

Start -1 -1 -1 -1

-1 -1 -1

-1

-1 -1

-1 -1 Goal

I a
Grid layout represents partial transition model Pss 0

I Numbers represent immediate reward Rass 0 from each state s (same for all a and s 0
in this case)
Categorizing agents

I Value Based
I No Policy (Implicit)
I Value Function
I Policy Based
I Policy
I No Value Function
I Actor Critic
I Policy
I Value Function
Categorizing agents

I Model Free
I Policy and/or Value Function
I No Model
I Model Based
I Optionally Policy and/or Value Function
I Model
Agent Taxonomy

Model-Free

Value Function Actor Policy


Critic

Value-Based Policy-Based

Model-Based

Model
Challenges in reinforcement learning
Learning and Planning

Two fundamental problems in reinforcement learning


I Learning:
I The environment is initially unknown
I The agent interacts with the environment
I Planning:
I A model of the environment is given
I The agent plans in this model (without external interaction)
I a.k.a. reasoning, pondering, thought, search, planning
Prediction and Control

I Prediction: evaluate the future (for a given policy)


I Control: optimize the future (find the best policy)
I These are strongly related:

π∗ (s) = argmax vπ (s)


π

I If we could predict everything do we need anything else?


Learning the components of an agent

I All components are functions


I Policies map states to actions
I Value functions map states to values
I Models map states to states and/or rewards
I State updates map states and observations to new states
I We could represent these functions as neural networks, then use deep learning
methods to optimize these
I Take care: we often violate assumptions from supervised learning (iid, stationarity)
I Deep reinforcement learning is a rich and active research field
I (Current) neural networks are not always the best tool (but they often work well)
Atari Example: Reinforcement Learning

observation action

ot at

I Rules of the game are


unknown
reward rt I Learn directly from
interactive game-play
I Pick actions on joystick, see
pixels and scores
Atari Example: Planning

I Rules of the game are known


I Can query emulator: perfect model right left

I If I take action a from state s:


I what would the next state be?
I what would the score be?
I Plan ahead to find optimal policy
right left right left
I Later versions add noise, to break algorithms
that rely on determinism
Exploration and Exploitation

I We learn by trial and error


I The agent should discover a good policy
I ...from new experiences
I ...without sacrifycing too much reward along the way
Exploration and Exploitation

I Exploration finds more information


I Exploitation exploits known information to maximise reward
I It is important to explore as well as exploit
I This is a fundamental problem that does not occur in supervised learning
Examples

I Restaurant Selection
Exploitation Go to your favourite restaurant
Exploration Try a new restaurant
I Oil Drilling
Exploitation Drill at the best known location
Exploration Drill at a new location
I Game Playing
Exploitation Play the move you currently believe is best
Exploration Try a new strategy
Gridworld Example: Prediction

A B 3.3 8.8 4.4 5.3 1.5


+5 1.5 3.0 2.3 1.9 0.5
+10 B’ 0.1 0.7 0.7 0.4 -0.4

-1.0 -0.4 -0.4 -0.6 -1.2


Actions
A’ -1.9 -1.3 -1.2 -1.4 -2.0

(a) (b)
Reward is −1 when bumping into a wall, γ = 0.9

What is the value function for the uniform random policy?


Gridworld Example: Control

A B 22.0 24.4 22.0 19.4 17.5

+5 19.8 22.0 19.8 17.8 16.0

+10 B’ 17.8 19.8 17.8 16.0 14.4

16.0 17.8 16.0 14.4 13.0

A’ 14.4 16.0 14.4 13.0 11.7

a) gridworld b) V* c) π
&*
What is the optimal value function over all possible policies?
What is the optimal policy?
Course

I In this course, we discuss how to learn by interaction


I The focus is on understanding core principles and learning algorithms
Topics include
I Exploration, in bandits and in sequential problems
I Markov decision processes, and planning by dynamic programming
I Model-free prediction and control (e.g., Q-learning)
I Policy-gradient methods
I Challenges in deep reinforcement learning
I Integrating learning and planning
I Guest lectures by Vlad Mnih and David Silver
Video

Locomotion

You might also like