Birla Institute of Technology and Science Pilani, Hyderabad Campus
29.11.2024
BITS F464: Machine Learning (1st Sem 2024-25)
Introduction to Reinforcement Learning(RL)
Chittaranjan Hota, Sr. Professor
Dept. of Computer Sc. and Information Systems
hota@hyderabad.bits-pilani.ac.in
What is Reinforcement Learning?
• Agent tries to maximize the cumulative reward from the
environment by performing a set of actions.
Image source: https://highlandcanine.com/ Image source: UltraTech Cement Stock, 27th Nov 2024 from
www.nseindia.com
Applications: Gaming, Robotics, Autonomous vehicles, Personalized treatment
etc.
Formal Modelling: Markov Decision Process
Markov: The future state can be determined only from the present state that
encapsulates all the necessary information from the past.
What should the player ‘O’ do here to avoid a loss?
MDP Continued…
(Discounted Cumulative Reward)
action
Agent state/obser Environment : Discount factor controlling
future rewards.
reward
St St+1 St+2
Q-Learning Algorithm
• Q-learning is a model-free reinforcement learning (RL) algorithm used to
learn the optimal policy for a Markov Decision Process (MDP)
Initialize Q-Table
Select an Action
After multiple Episodes,
a good Q-Table is ready
Perform Action
Measure Reward
+ [ + max ]
Update Q-Table
An Example of Q-Learning
• Initializing the environment: States: {s0, s1, s2}, Actions: {a0, a1}, Rewards:
R(s0, a0) = -1, R(s0, a1) = +2, R(s1, a0) = +3, R(s1, a1) = +1, R(s2, any action) = 0
(terminal state).
• Transitions: T(s0, a0) s1, T(s0,a1) s2 (goal), T(s1,a0) s2, T(s1, a1)s0
• Parameters: α = 0.5, γ = 0.9, Initial Q-values (Q(s, a) = 0 for all s, a).
• Episode 1:
• current state: s0, action chosen: a0 (randomly using exploration), reward: R(s0,
a0) = -1, next state: s1.
• Update Q(s0,a0) using Bellman’s equation:
• Q (s0, a0) 0 + 0.5 [-1 + 0.9 * max Q(s1, a’) – 0]
• Q(s0, a0) 0.5 * [-1 + 0] = -0.5 (Since, Q(s1, a’) = 0 initially (no knowledge of s1).
Updated Q-values after 3 Episodes
State Action(a0) Action(a1)
Ex. Continued… s0 -0.5 1.0
s1 1.5 0.0
s2 0.0 0.0
•Episode 2: From s1
•current state: s1, action chosen: a0, reward: R(s1, a0) = +3, next state: s2.
•Update Q(s0,a0) using Bellman’s equation:
Q(s1,a0) Q(s1,a0) + α[R+ γ max Q(s2, a’) – Q(s1, a0)]
a’
•Q (s1, a0) 0 + 0.5 [3 + 0.9 * 0 – 0] = 1.5
•Episode 3: Back to s0(different action)
•current state: s0, action chosen: a1, reward: R(s0, a1) = +2, next state: s2.
•Update Q(s0,a1) using Bellman’s equation:
•Q(s0,a1) Q(s0,a1) + α[R+ γ max Q(s2, a’) – Q(s0, a1)]
a’
•Q (s0, a1) 0 + 0.5 [2 + 0.9 * 0 – 0] = 1.0
• Alternatively, you may use an ANN to learn Q-values: Deep Q-Learning (DQN)
Optimal Solution using Q-Learning: Maze
import numpy as np
import
matplotlib.pyplot
as plt
# Maze parameters
maze = [
[0, 1, 0, 0, 0],
[0, 1, 0, 1, 0],
[0, 0, 0, 1, 0],
[0, 1, 0, 0, 0],
[0, 0, 0, 1, 2]
#'2'is the diamond
(goal state)
]
maze = np.array(maze)
Python Code
Continued…
…
Deep Q-Learning (DQN) for RL
• When the number of states and actions become very large, how do you
scale?
• Solution: Combine Q-Learning and Deep Learning Deep Q-Networks (DQN)
• Goal: Approximate a function: Q(s,a; θ), where θ represents the
trainable weights of the network
• Q(s,a) = r(s,a) + γ max Q(s’,a) Bellman’s equation
• Cost = {Q(s,a; θ) – [r(s,a)+γ max Q(s’,a; θ)]}2 ⇔
ANN
Q(a1)
s
Q
Q(a2)
⇒ Q(a3)
a (In-efficient as we need more
iterations) (Improved)
Thank You!
Good luck for Comprehensive Exams!