.
Lecture Notes on Reinforcement Learning Basics
By
Dr. Adetokunbo MacGregor JOHN-OTUMU
1. Definition of Reinforcement Learning (RL)
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make
decisions by performing actions in an environment to maximize cumulative rewards. Unlike
supervised learning, RL does not require labeled input/output pairs and instead learns from the
consequences of actions.
2. Key Concepts in Reinforcement Learning
Agent:
• The learner or decision maker.
Environment:
• Everything the agent interacts with.
State (s):
• A representation of the current situation of the agent in the environment.
Action (a):
• Any decision or move the agent can make.
Reward (r):
1|Page
• Feedback from the environment based on the action taken by the agent.
Policy (π):
• A strategy that defines the action the agent should take in a given state.
Value Function (V):
• A function that estimates the expected reward for a given state.
Q-Function (Q):
• A function that estimates the expected reward for a given state-action pair.
Episode:
• A sequence of states, actions, and rewards, ending in a terminal state.
3. Types of Reinforcement Learning
Model-Free vs. Model-Based:
• Model-Free: The agent learns directly from interactions without a model of the
environment.
• Model-Based: The agent builds a model of the environment and plans by simulating
actions.
Value-Based vs. Policy-Based:
• Value-Based: The agent learns the value function to determine the best action.
• Policy-Based: The agent directly learns the policy without using value functions.
On-Policy vs. Off-Policy:
2|Page
• On-Policy: The agent learns the value of the policy it is currently following.
• Off-Policy: The agent learns the value of an optimal policy while following another policy.
4. Q-Learning Algorithm
Definition: Q-Learning is a model-free, off-policy reinforcement learning algorithm that seeks to
find the best action to take given the current state. It updates Q-values (action-value function)
iteratively using the Bellman equation.
Q-Learning Formula:
Q(s,a) ← Q(s, a) + α[r + γ maxa′ Q(s′, a′) − Q(s, a)]
where:
• Q(s,a) is the current Q-value of state s and action a.
• α\alphaα is the learning rate (0 < α ≤ 1).
• r is the reward received after taking action a in state s.
• γ is the discount factor (0 ≤ γ ≤ 1).
• maxa′ Q(s′, a′) is the maximum predicted Q-value of the next state s′.
Q-Learning Steps:
1. Initialize Q-values: For all state-action pairs arbitrarily, often to zero.
2. For each episode:
o Initialize the starting state s.
o Repeat until the terminal state is reached:
1. Choose an action a based on the current state s (using an exploration strategy
like ε-greedy).
2. Perform action a and observe the reward r and the next state s′.
3. Update the Q-value Q(s, a) using the Q-Learning formula.
4. Set s to s′.
3|Page
Exploration vs. Exploitation:
• Exploration: Trying new actions to discover their effects.
• Exploitation: Choosing actions that are known to yield high rewards.
ε-Greedy Strategy:
• With probability ε, choose a random action (exploration).
• With probability 1−ε, choose the action with the highest Q-value (exploitation).
5. Detailed Understanding of Deep Q-Learning (DQN)
Motivation: Traditional Q-Learning struggles with high-dimensional state spaces. Deep Q-
Learning addresses this by approximating the Q-value function using a neural network.
DQN Architecture:
• Input Layer: Represents the state.
• Hidden Layers: Multiple layers of neurons to capture complex patterns.
• Output Layer: Represents Q-values for each action.
DQN Algorithm Steps:
1. Experience Replay:
o Store experiences (state, action, reward, next state) in a replay buffer.
o Sample mini-batches from the replay buffer to train the network, breaking the
correlation between consecutive experiences.
2. Target Network:
o Use a separate target network to stabilize training.
o The target network's weights are periodically updated to match the main network's
weights.
DQN Update Formula:
4|Page
Q(s,a;θ) ← Q(s,a;θ) + α[r+γmaxa′ Q(s′,a′;θ−) − Q(s,a;θ)]
where:
• θ are the parameters of the Q-network.
• θ− are the parameters of the target network.
Key Innovations in DQN:
• Experience Replay: Improves data efficiency and stabilizes training.
• Target Network: Reduces the correlations between Q-value updates.
6. Applications of Q-Learning and DQN
Games:
• Atari games, Go, Chess, etc.
Robotics:
• Path planning, control tasks, etc.
Finance:
• Portfolio management, trading strategies, etc.
Healthcare:
• Personalized treatment strategies, medical diagnosis, etc.
Autonomous Systems:
• Self-driving cars, drones, etc.
5|Page
7. Challenges in Reinforcement Learning
Exploration-Exploitation Trade-off:
• Balancing the need to explore new actions and exploit known rewards.
Credit Assignment Problem:
• Determining which actions are responsible for received rewards.
Sparse Rewards:
• Environments where rewards are infrequent and delayed.
Sample Efficiency:
• RL algorithms often require large amounts of data and interactions.
Stability and Convergence:
• Ensuring stable and convergent learning, especially in complex environments.
Conclusion
Reinforcement Learning is a powerful paradigm for training agents to make decisions in complex
environments. Q-Learning and its deep learning variant, DQN, have demonstrated remarkable
success in various domains. Understanding these algorithms' principles, workflows, and
applications is crucial for leveraging their full potential in solving real-world problems.
6|Page