\UNIT 5
What is Reinforcement Learning?
Reinforcement Learning is a part of machine learning. Here, agents are self-
trained on reward and punishment mechanisms. It’s about taking the best
possible action or path to gain maximum rewards and minimum
punishment through observations in a specific situation. It acts as a signal
to positive and negative behaviors. Essentially an agent (or several) is built
that can perceive and interpret the environment in which is placed,
furthermore, it can take actions and interact with it.
Basic Diagram of Reinforcement Learning – KDNuggets
Terminologies used in Reinforcement Learning
Terminologies in RL – Techvidvan
Agent – is the sole decision-maker and learner
Environment – a physical world where an agent learns and decides the
actions to be performed
Action – a list of action which an agent can perform
State – the current situation of the agent in the environment
Reward – For each selected action by agent, the environment gives a
reward. It’s usually a scalar value and nothing but feedback from the
environment
Policy – the agent prepares strategy(decision-making) to map situations to
actions.
Value Function – The value of state shows up the reward achieved starting
from the state until the policy is executed
Model – Every RL agent doesn’t use a model of its environment. The
agent’s view maps state-action pairs probability distributions over the
states
Reinforcement Learning Workflow
Reinforcement Learning Workflow – KDNuggets
– Create the Environment
– Define the reward
– Create the agent
– Train and validate the agent
– Deploy the policy
Characteristics of Reinforcement Learning
– No supervision, only a real value or reward signal
– Decision making is sequential
– Time plays a major role in reinforcement problems
– Feedback isn’t prompt but delayed
– The following data it receives is determined by the agent’s actions
We can break down reinforcement learning into five simple steps:
1. The agent is at state zero in an environment.
2. It will take an action based on a specific strategy.
3. It will receive a reward or punishment based on that action.
4. By learning from previous moves and optimizing the strategy.
5. The process will repeat until an optimal strategy is found.
What is Q-Learning?
Q-learning is a model-free, value-based, off-policy algorithm that will find
the best series of actions based on the agent's current state. The “Q”
stands for quality. Quality represents how valuable the action is in
maximizing future rewards.
The model-based algorithms use transition and reward functions to
estimate the optimal policy and create the model. In contrast, model-
free algorithms learn the consequences of their actions through the
experience without transition and reward function.
The value-based method trains the value function to learn which state is
more valuable and take action. On the other hand, policy-based methods
train the policy directly to learn which action to take in a given state.
In the off-policy, the algorithm evaluates and updates a policy that differs
from the policy used to take an action. Conversely, the on-policy algorithm
evaluates and improves the same policy used to take an action.
Key Terminologies in Q-learning
Before we jump into how Q-learning works, we need to learn a few useful
terminologies to understand Q-learning's fundamentals.
• States(s): the current position of the agent in the environment.
• Action(a): a step taken by the agent in a particular state.
• Rewards: for every action, the agent receives a reward and penalty.
• Episodes: the end of the stage, where agents can’t take new action.
It happens when the agent has achieved the goal or failed.
• Q(St+1, a): expected optimal Q-value of doing the action in a
particular state.
• Q(St, At): it is the current estimation of Q(St+1, a).
• Q-Table: the agent maintains the Q-table of sets of states and
actions.
• Temporal Differences(TD): used to estimate the expected value of
Q(St+1, a) by using the current state and action and previous state
and action.
How Does Q-Learning Work?
We will learn in detail how Q-learning works by using the example of a
frozen lake. In this environment, the agent must cross the frozen lake from
the start to the goal, without falling into the holes. The best strategy is to
reach goals by taking the shortest path.
Gif by Author
Q-Table
The agent will use a Q-table to take the best possible action based on the
expected reward for each state in the environment. In simple words, a Q-
table is a data structure of sets of actions and states, and we use the Q-
learning algorithm to update the values in the table.
Q-Function
The Q-function uses the Bellman equation and takes state(s) and action(a)
as input. The equation simplifies the state values and state-action value
calculation.
Image from freecodecamp.org
Q-learning algorithm
Image by Author
Initialize Q-Table
We will first initialize the Q-table. We will build the table with columns
based on the number of actions and rows based on the number of states.
In our example, the character can move up, down, left, and right. We have
four possible actions and four states(start, Idle, wrong path, and end). You
can also consider the wrong path for falling into the hole. We will initialize
the Q-Table with values at 0.
Image by Author
Choose an Action
The second step is quite simple. At the start, the agent will choose to take
the random action(down or right), and on the second run, it will use an
updated Q-Table to select the action.
Perform an Action
Choosing an action and performing the action will repeat multiple times
until the training loop stops. The first action and state are selected using
the Q-Table. In our case, all values of the Q-Table are zero.
Then, the agent will move down and update the Q-Table using the Bellman
equation. With every move, we will be updating values in the Q-Table and
also using it for determining the best course of action.
Initially, the agent is in exploration mode and chooses a random action to
explore the environment. The Epsilon Greedy Strategy is a simple method
to balance exploration and exploitation. The epsilon stands for the
probability of choosing to explore and exploits when there are smaller
chances of exploring.
At the start, the epsilon rate is higher, meaning the agent is in exploration
mode. While exploring the environment, the epsilon decreases, and agents
start to exploit the environment. During exploration, with every iteration,
the agent becomes more confident in estimating Q-values
Image by Author
In the frozen lake example, the agent is unaware of the environment, so it
takes random action (move down) to start. As we can see in the above
image, the Q-Table is updated using the Bellman equation.
Measuring the Rewards
After taking the action, we will measure the outcome and the reward.
• The reward for reaching the goal is +1
• The reward for taking the wrong path (falling into the hole) is 0
• The reward for Idle or moving on the frozen lake is also 0.
Update Q-Table
We will update the function Q(St, At) using the equation. It uses the
previous episode’s estimated Q-values, learning rate, and Temporal
Differences error. Temporal Differences error is calculated using Immediate
reward, the discounted maximum expected future reward, and the former
estimation Q-value.
The process is repeated multiple times until the Q-Table is updated and
the Q-value function is maximized.
Image by Author | Equation Visuals from Thomas Simonini
At the start, the agent is exploring the environment to update the Q-table.
And when the Q-Table is ready, the agent will start exploiting and start
taking better decisions.
What is temporal difference learning?
Temporal Difference Learning is an unsupervised learning technique that is
very commonly used in reinforcement learning for the purpose of
predicting the total reward expected over the future. They can, however, be
used to predict other quantities as well. It is essentially a way to learn how
to predict a quantity that is dependent on the future values of a given
signal. It is a method that is used to compute the long-term utility of a
pattern of behaviour from a series of intermediate rewards.
The temporal difference algorithm always aims to bring the expected
prediction and the new prediction together, thus matching expectations
with reality and gradually increasing the accuracy of the entire chain of
prediction.
Temporal Difference Learning aims to predict a combination of the
immediate reward and its own reward prediction at the next moment in
time.
• Alpha (α): learning rate
It shows how much our estimates should be adjusted, based on the
error. This rate varies between 0 and 1.
• Gamma (γ): the discount rate
This indicates how much future rewards are valued. A larger discount
rate signifies that future rewards are valued to a greater extent. The
discount rate also varies between 0 and 1.
• e: the ratio reflective of exploration vs. exploitation.
This involves exploring new options with probability e and staying at
the current max with probability 1-e. A larger e signifies that more
exploration is carried out during training
What is the benefit of temporal difference learning?
The advantages of temporal difference learning in machine learning are:
• TD learning methods are able to learn in each step, online or offline.
• These methods are capable of learning from incomplete sequences,
which means that they can also be used in continuous problems.
• Temporal difference learning can function in non-terminating
environments.
• TD Learning has less variance than the Monte Carlo method, because
it depends on one random action, transition, reward.
• It tends to be more efficient than the Monte Carlo method.
• Temporal Difference Learning exploits the Markov property, which
makes it more effective in Markov environments.
What are the disadvantages of temporal difference learning?
There are two main disadvantages:
• It has greater sensitivity towards the initial value.
• It is a biased estimation.