Reinforcement Learning
Unit 1
• Introduction: Elements of Reinforcement Learning, Limitations and
Scope, An Extended Example: Tic-Tac-Toe.
• Tabular Solution Methods: An n-Armed Bandit Problem, Action-Value
Methods, Incremental Implementation, Tracking a Nonstationary
Problem, Optimistic Initial Values, Upper-Confidence Bound Action
Selection, Gradient Bandits.
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Reinforcement learning is a type of machine learning
technique where a computer agent learns to perform a
task through repeated trial and error interactions with a
dynamic environment.
This learning approach enables the agent to make a
series of decisions that maximize a reward metric for the
task without human intervention and without being
explicitly programmed to achieve the task.
Key Idea:
•Trial & error learning
•Rewards & penalties guide learning
1. Robotics & Automation 🤖 : Self-learning robots (Boston Dynamics, warehouse automation)
2. Game AI 🎮 : AlphaGo, AlphaZero (beating human champions in Go, Chess)
3. Self-Driving Cars 🚗: Tesla Autopilot, Waymo self-driving systems
4. Finance & Stock Trading 📈 : Algorithmic trading bots (predicting stock trends)
5. Healthcare & Drug Discovery 🏥: AI-based personalized treatment plans (e.g., chemotherapy
optimization)
6. Smart Systems & IoT ⚡ : Reinforcement learning for smart cities and automation
7. Recommendation Systems 🎵📺: Personalized content recommendations (Netflix, YouTube, Spotify)
8. Natural Language Processing (NLP) & Chatbots 🤖💬 : AI assistants (ChatGPT, Alexa, Google
Assistant)
9. Cybersecurity 🔒: Detecting cyber threats using RL-based anomaly detection
10. Industrial Process Optimization 🏭 : Predictive maintenance for manufacturing equipment
Elements of Reinforcement Learning
Policy
• A policy defines the learning agent’s way of behaving at a given time.
Roughly speaking, a policy is a mapping from perceived states of the
environment to actions to be taken when in those states.
• In some cases the policy may be a simple function or lookup table,
whereas in others it may involve extensive computation such as a
search process.
Rewards
• A reward signal defines the goal in a reinforcement learning problem.
• On each time step, the environment sends to the reinforcement
learning agent a single number, a reward.
• The agent’s sole objective is to maximize the total reward it receives
over the long run.
• The reward signal thus defines what are the good and bad events for
the agent.
Value Functions
• Whereas the reward signal indicates what is good in an immediate
sense, a value function specifies what is good in the long run.
• The value of a state is the total amount of reward an agent can expect
to accumulate over the future, starting from that state.
• Rewards are in a sense primary, whereas values, as predictions of
rewards, are secondary.
• Without rewards there could be no values, and the only purpose of
estimating values is to achieve more reward.
Models
• Model is something that mimics the behavior of the environment, or
more generally, that allows inferences to be made about how the
environment will behave.
• Models are used for planning, by which we mean any way of deciding
on a course of action by considering possible future situations before
they are actually experienced.
• Methods for solving reinforcement learning problems that use models
and planning are called model-based methods.
Limitations & Scope
•RL methods are primarily focused on estimating value functions.
•Alternative methods like evolutionary algorithms can also solve RL problems without value
functions.
Evolutionary Methods
•Methods such as genetic algorithms and simulated annealing evolve policies over time.
•These methods don’t learn from individual actions but instead evaluate overall behavior.
•Effective in some cases, especially when state information is unclear.
Limitations of Evolutionary Methods
•Evolutionary methods do not use the structure of the RL problem (e.g., state-action
relationships).
•Less efficient compared to RL methods that use detailed behavioral interactions.
•Do not learn during interaction with the environment unless evolving learning algorithms.
Policy Gradient Methods
•Unlike evolutionary methods, policy gradient methods adjust policies
during interaction with the environment.
•These methods estimate how to improve a policy by adjusting
numerical parameters.
•More efficient because they use information from individual
interactions to guide learning.
Reinforcement Learning and Optimization
•RL agents aim to maximize the reward, but achieving the maximum
reward is not always possible.
•The focus is on improving the agent's reward over time, not necessarily
achieving the maximum.
AN EXTENDED EXAMPLE: TIC-TAC-TOE
Consider Tic-Tac-Toe as an example to illustrate reinforcement learning (RL) and contrast
it with classical approaches.
In the game, an RL-based player can learn to exploit an imperfect opponent’s mistakes to
maximize its chances of winning.
Classical methods like minimax assume an optimal opponent and may not adapt to real
gameplay, while dynamic programming requires a complete model of the opponent’s
behavior, which is often unavailable.
RL, however, learns from experience by playing multiple games, gradually estimating the
opponent's behavior and improving strategy over time.
This makes RL a practical approach for problems where opponent behavior is unknown.
Reinforcement learning (RL) using a value function can help a player improve at Tic-Tac-Toe by learning
from experience. Here’s a breakdown in simple terms:
1. Using a Value Function
▪ We create a table where each possible game state gets a
number (value) representing our estimated chance of winning.
▪ If a state has three X’s in a row, its value is 1 (since we won). If
three O’s are in a row or the board is full, its value is 0 (since we
lost or tied).
▪ All other states start with a guess of 0.5 (a 50% chance of
winning).
2. Playing and Learning
• Each time we play, we look at possible moves and check their
values.
• Most of the time, we choose the move with the highest value
(best chance of winning).
• Sometimes, we pick a random move to explore new
situations—this is called exploration.
3. Updating the Value Function (Learning from Mistakes and Successes)
• After every move, we update the values in our table.
• If a move leads to a good outcome, we increase the value of the previous state. If it leads to a bad
outcome, we lower the value.
• This update happens step by step, gradually improving our estimates over many games.
4. Why This Works Well
• Over time, our values get more accurate, helping us play optimally against a given opponent.
• If our opponent changes their strategy, we can still adapt by keeping a small learning rate.
5. Comparison with Evolutionary Methods
• Evolutionary methods test entire strategies over many games before making any changes, without considering
individual moves.
• Value function methods learn from each move in real-time, making them faster and more efficient at improving
gameplay.
Analyse how learning occurs in the Tic-Tac-Toe example using value functions.
Numerical example of Tic-Tac-Toe using Reinforcement Learning (RL) with a Value Function.
Step 1: Initial Setup – Assigning Values to Game States
We create a table where each possible Tic-Tac-Toe board state has a value representing our estimated chance of
winning.
•Winning states (e.g., three Xs in a row) get a value of 1.
•Losing states (e.g., three Os in a row) get a value of 0.
•All other states start at 0.5 (since we don’t know yet if they are good or bad).
Step 2: Playing a Game with Learning
We now play a game against an opponent and update our values based on the result.
Game Progression Example
Move 1 (Starting State, Random Selection)
•RL chooses a move randomly (exploration).
•Let's say it picks the middle square (1,1).
The current estimated value of this state is 0.5
Move 2 (Opponent’s Turn)
Opponent (O) moves to position (0,0):
No learning happens yet because it’s not our turn.
Move 3 (Greedy Move Selection)
•We check all possible moves and pick the one with the highest
value.
•Assume our table values suggest moving to (2,2) is best.
•We move X to (2,2).
Move 4 (Opponent Moves)
Opponent (O) moves to (0,1):
Again, no learning happens yet.
Move 5 (Greedy Move, Near Win)
We move X to (1,2):
The estimated value of this state increases since it looks closer
to a win.
Move 6 (Opponent Moves)
Opponent (O) moves to (0,2):
The learning rate (denoted as α) is a parameter that controls how much
the agent updates its knowledge after receiving new information.
Changes are based on a difference, V (s) − V (s’), between estimates at
two different times
Role of Learning Rate (α):
● High Learning Rate (e.g., α = 0.9):
○ Faster learning, but can be unstable.
○ The agent adapts quickly but may overreact to new information.
● Low Learning Rate (e.g., α = 0.01):
○ Slower learning, but more stable.
○ The agent updates values slowly and does not forget past
experiences quickly.
Tabular Solution Methods
Multi-arm Bandits
• Reinforcement learning (RL) is different from other types of
learning because it learns from feedback that evaluates
actions rather than directly telling which action is correct.
This means RL relies on trial and error to find good actions.
• There are two types of feedback:
1. Evaluative feedback – tells how good an action was but not
if it was the best or worst option. This is common in
optimization methods like evolutionary algorithms.
2. Instructive feedback – directly tells the correct action to
take, regardless of what was chosen. This is used in
supervised learning, such as neural networks.
These two types of feedback are usually separate but can sometimes be mixed.
To understand evaluative feedback better, we first study a simple version of reinforcement learning called the n-armed
bandit problem. This problem helps introduce key learning methods before moving on to more complex RL scenarios where
actions must be chosen in different situations.
An One Armed Bandit Problem
A one-armed bandit is a nickname for a slot machine, the
kind you see in casinos. It gets this name because:
1. "One-armed" – Traditional slot machines have a single
lever (or "arm") on the side that you pull to spin the
reels. Modern machines often use buttons instead, but
the name remains.
2. "Bandit" – The machine takes your money and rarely
pays out big winnings, just like a thief or "bandit."
How Does a One-Armed Bandit Work?
● You insert money (coins, bills, or a card with credits).
● You pull the lever (or press a button).
● The reels spin and then stop at random symbols.
● If the symbols match a winning combination, you
receive a payout. Otherwise, you lose your bet.
n-Armed Bandit Problem
Imagine you walk into a casino filled with slot machines (also called
"one-armed bandits" because of their lever). However, instead of just one
machine, you have n different slot machines to choose from. Each machine
gives out different amounts of money when you pull its lever, but you don’t
know in advance which one pays the most. Your goal is to maximize your
total winnings over a fixed number of turns (let’s say 1000).
Now, think about how you should play to win as much as possible.
Assume the player is well aware of all
these probabilities of each machine
and plays the first machine 1000 times
The non triviality of the multi armed bandit problem lies in the fact that we(the agent) cannot access the true bandit probability
distributions- all learning is carried via trial & error and value estimation.
So we now have two method :
1. Play the machine that has given you the highest rewards so far (this is called exploitation – taking advantage of what you
know).
2. Try a different machine to see if it gives you a higher reward (this is called exploration – gathering more information).
Agent tries only Machine 2 and
thinks 30% success is the best.
Exploitation
Exploration: Tries all machines 250 times each.
This creates a conflict between exploration and exploitation:
● If you always exploit(Greedy Selection), you might stick to a machine that seems good but isn’t actually the best.
● If you explore too much, you might waste time on bad machines and miss out on higher rewards from the best machine.
● The key challenge is finding the right balance between exploration and exploitation.
Balancing exploration (trying new actions) and exploitation (choosing the
best-known action) is a key challenge in reinforcement learning.
● Short-term vs. Long-term Tradeoff: Exploring may lead to lower rewards
initially, but it helps discover better actions that can be exploited for higher
rewards in the long run.
● Complexity: The best choice between exploration and exploitation
depends on factors like estimated values, uncertainties, and remaining time
steps.
● Limitations of Advanced Methods: Many sophisticated techniques
assume stable conditions and prior knowledge, which may not hold in
real-world applications.
● Focus : Instead of perfect optimization, the goal is to find simple strategies
that effectively balance exploration and exploitation, as always exploiting is
suboptimal.
Real Life Bandit Problems
The multi-armed bandit (MAB) problem models decision-making under uncertainty, making it useful in
various real-world scenarios where we must balance exploration (trying new things) and exploitation (using
the best-known option).
• Design of Experiments (Clinical Trials): A doctor is testing multiple treatments for a disease but doesn’t
know which one works best.The doctor tries different treatments (exploration) while favoring the most
effective one (exploitation).
• Online Ad placement :An advertiser has multiple ad versions but doesn’t know which one gets the most
clicks.
• Website page persona: A website has multiple page designs but doesn’t know which one converts best
(e.g., more sign-ups, purchases).
• Games: A game AI needs to choose the best strategy to win or adapt to a player's skill level.
Action-Value Methods- Greedy Method/ Sample Average Method
Lets explore simple balancing methods for the n-armed bandit problem and show that they work much better than methods that
always exploit.
Action-value methods are simple techniques used in reinforcement learning to estimate the value of different actions and make
decisions based on these estimates.
A gambler is playing a 5-armed bandit machine, where each arm provides a reward based on an unknown probability
distribution. The gambler plays for 15 rounds, receiving the following sequence of actions and rewards:
Compute the estimated action values Q(a) for each action after 15 rounds.
Which action has the highest estimated value, and should the gambler always select it moving forward?
Step 1: Compute the Estimated Action Values Q(a)
The estimated action value Qt(a) is calculated as:
where:
● ∑R is the sum of all rewards received for that action.
● Nt(a) is the number of times the action has been selected.
Let's compute the values step by step for each action.
Action A:
● Rewards received: 6.0, 5.5, 4.8, 5.2
● Number of times selected: 4
Action B:
● Rewards received: 7.5, 5.8, 6.2, 6.1
● Number of times selected: 4
Action C:
● Rewards received: 3.8, 4.5, 4.1
● Number of times selected: 3
Action D:
● Rewards received: 6.9, 7.4, 6.8
● Number of times selected: 3
Action E:
● Rewards received: 2.5
● Number of times selected: 1
Step 2: Identify the Best Action
The final estimated values are:
The action with the highest estimated value is D (7.03).
If we were following a greedy strategy, we would now always select D.
Limitation : If we always select the action with the highest estimated value (D), we risk missing better opportunities
because:
● The estimates might still be inaccurate due to insufficient exploration.
● Actions B and A have relatively high values, and their true values might be higher than D.
Instead, the gambler should balance exploration and exploitation.
A digital marketing team is testing four different ad campaigns (A, B, C, D) for a product. Each ad generates revenue
based on click-through rates (CTR), but the true effectiveness of each ad is unknown. The team follows a greedy
method to maximize immediate revenue.
Given Data: Each ad has an unknown true average revenue per click, but the team only sees the revenue from each
trial. They start testing with random ad selections and then switch to a greedy strategy (always picking the highest
estimated ad).
They run 12 trials and record the following selections and observed revenues (in dollars):
1. Compute the estimated revenue for each ad after 12 trials
using the greedy method.
2. Identify which ad will be selected in the next round under
the greedy approach.
3. Explain the risk of using the greedy method in this case.
Compute the Estimated Revenue for Each Ad
Since Ad D has the highest estimated revenue
(7.25), the greedy method will always select Ad
D in the next round.
Risks of the Greedy Method
Although the greedy method maximizes short-term revenue, it has two major risks:
1. Lack of Exploration:
○ Ad C was tested only once. It might actually be a high-revenue option, but because it was not tested
enough, it gets ignored.
○ The same applies to Ad A, which was tested only twice.
2. Potential for a Suboptimal Choice:
○ The greedy method assumes Ad D is the best, but the estimates may still be inaccurate.
○ A better strategy would be ε-greedy, where some fraction of selections (e.g., 10%) is used for exploring
other ads.
Greedy Play : Suppose the reinforcement learning player was greedy, that is, it always played the move
that brought it to the position that it rated the best. Would it learn to play better, or worse, than a
nongreedy player? What problems might occur?
Lack of Exploration
Greedy player focuses only on immediate rewards.
Prevents discovering new, potentially better strategies.
Overfitting to Current Knowledge
Relies on inaccurate initial value estimates. (mis-estimates)
Reinforces suboptimal actions based on misjudged rewards.
Local Optima
Sticks to strategies that seem optimal in the short term.
May miss better long-term strategies that require suboptimal moves initially.
Slow Learning and Convergence
Inefficient learning due to limited exploration.
May take longer to converge to an optimal strategy.
Non-greedy Players Perform Better
Non-greedy strategies (e.g., epsilon-greedy) balance exploration and exploitation.
More likely to discover better overall strategies over time.
Epsilon Greedy Method
The ε-greedy algorithm is a simple and effective action-selection strategy used in reinforcement learning (RL) and
multi-armed bandit problems. It is designed to balance two conflicting objectives:
1. Exploration – Trying new actions to discover potentially better rewards.
2. Exploitation – Choosing the best-known action to maximize immediate rewards.
How the ε-Greedy Algorithm Works
First choose each arm once
Then at each step, the agent decides between:
● Greedy action (exploitation): With probability (1−ϵ), the agent selects the action with the highest estimated value.
● Random action (exploration): With probability ϵ, the agent selects a random action.
Consider a 2 arm bandit
Step 1 : Choose each arm once - Arm 1
Step 2 : Choose each arm once - Arm 2
Step 3 Assume ε = 1
Choose empirically best arm with probability 0%
Choose an arm uniformly at random with probability 100%
Step 3 Assume ε =0.9
Choose empirically best arm with probability 10%
Choose an arm uniformly at random with probability 90%
How to choose between best arm and random arm
Assume a random number r= np.random.rand() = 0.48
Choose empirically best arm if r >= ε
Choose an arm uniformly at random if r < ε i e 0.48< 0.9 , so choose randomly
Step 4 Assume ε=0.5
Choose empirically best arm with probability 50%
Choose an arm uniformly at random with probability 50%
How to choose between best arm and random arm
Assume we get a random number r= np.random.rand() = 0.8
Choose empirically best arm if r >= ε i e 0.8> 0.5 , so choose best arm
Choose an arm uniformly at random if r < ε
Decay of ϵ:
At the beginning, the agent explores more to gather information.
Over time, exploration reduces, and the agent exploits the best-known option.
ε is adaptively reduced, leading to efficient learning.
This approach balances exploration and exploitation, ensuring the algorithm does not get stuck in
suboptimal decisions.
Decay of ϵ:
● The term CK/tΔ^2 represents how ϵ decays over time. As t (the number of actions taken so far) increases, the term becomes smaller, meaning ϵ gets
smaller as time goes on. This means the agent will gradually focus more on exploitation (i.e., choosing the best arm) over time, rather than exploring
random arms.
● The factor CK scales the decay. C can be adjusted to control the rate of decay, and K is the number of arms, ensuring that more arms will influence
the exploration rate.
The Role of Δ^2:
● The term Δ^2 represents the squared difference between the optimal arm's reward (Q∗) and the estimated reward of a given arm (Q(a)).
● A larger Δ (larger gap between optimal and suboptimal arms) means the agent has more confidence in the optimal arm and is more likely to exploit it.
This encourages the agent to focus more on exploitation as Δ grows.
● If Δ is small (the rewards of the arms are similar), the agent will explore more to better distinguish between the arms.
Optimal Arm: This is the arm (action) that provides the highest expected reward over time. It is the best arm among all available arms in terms of
long-term return.
Suboptimal Arms: These are the arms that provide lower expected rewards compared to the optimal arm.
Delta (Δ) - The Gap Between Optimal and Suboptimal Arms: Δ represents the difference between the expected reward of the optimal arm (Q∗) and the
expected reward of a suboptimal arm (Q(a)).
Δ=Q∗−Q(a)
A larger Δ means that the optimal arm has a much higher expected reward than a suboptimal arm, giving the agent more confidence that choosing the
optimal arm will consistently yield higher returns.
A smaller Δ means that the rewards of the optimal arm and the suboptimal arms are closer, which makes it harder for the agent to distinguish between them,
requiring more exploration to learn which arm is truly the best.
Limitations of the Action-Value Method:
1. Memory Usage:
○ The method requires maintaining a record of the rewards for each action, which
could become memory-intensive if the number of actions is large, especially in
real-world problems.
2. Computational Cost:
○ In its straightforward form, the action-value method requires recalculating the
average value each time an action is selected, which can be computationally
expensive as the number of rewards increases for each action.
Incremental implementation
The solution to this is the use of incremental updating to compute the action-value estimate without having to
store all the past rewards.
For some action, let Qk denote the estimate for its kth reward, that is, the average of its first k − 1 rewards. Given
this average and a kth reward for the action, Rk, then the average of all k rewards can be computed by
Advantages of incremental Approach
• The key benefit of this incremental formula is that it only requires storing the
current average Qk and the new reward Rk, avoiding the need to store and sum all
previous rewards.
• The memory usage remains constant regardless of how many times the action is
selected, and the computation per step is constant (O(1)).
• The action value is updated efficiently without needing to revisit all previous
rewards.
Tracking a Nonstationary Problem
Consider example
Imagine you run an online store and track which products customers like the most. If you simply average all past sales, your
recommendations may become outdated.
For example, if winter coats were popular months ago but now it's summer, those past sales shouldn't influence today's
recommendations as much. Instead, using a constant step-size update, you can give more weight to recent purchases (like
sunglasses or swimsuits) and adapt quickly to changing customer preferences.
This way, your store always suggests the most relevant products, keeping customers happy!
Tracking a Nonstationary Problem
In reinforcement learning, the non-stationary n-armed bandit problem is a variation of the classical
multi-armed bandit (MAB) problem, where the reward distribution of each arm changes over time. This
poses an additional challenge since optimal strategies must adapt to these changes instead of converging to
a fixed optimal arm.
Changing Reward Distributions
● The expected reward of each arm may drift over time due to external factors.
● Changes could be abrupt (e.g., sudden shifts) or gradual (e.g., slow drift).
● Classical stationary bandit algorithms (like ε-greedy) assume fixed reward distributions.
● In non-stationary environments, older observations may become irrelevant, requiring
algorithms to be more adaptive.
In such cases it makes sense to weight recent rewards more heavily than long-past ones. One of the most
popular ways of doing this is to use a constant step-size parameter. Instead of treating all past rewards
equally, we use a weighted average, where recent rewards are given more importance.
Consider the incremental Implementation equation for estimating the
action value
The incremental update rule for updating an average Qk of the k − 1 past rewards is modified to be
How Are the Weights Assigned?
● The most recent reward Rₖ gets the highest weight.
● The older rewards Rᵢ get progressively lower weights.
● The weights decay exponentially (shrink quickly as time passes).
where:
● α (alpha) = Step size (a small number like 0.1 or 0.01). It controls how much importance the new reward gets.
● (1−α) = A number less than 1, meaning that as time passes, the weight of past rewards gets smaller and smaller.
● k−i = The difference between the current step k and the past step i, meaning how long ago the reward was received.
Optimal Initial Values in Reinforcement
Learning
When training a reinforcement learning (RL) agent, the initial values of action estimates
(Q(a)) can significantly impact learning efficiency. Optimistic Initialization is a technique
where we start with high initial values to encourage exploration early on.
● Previous methods use 0 as the default initial value for the q value estimates
● This means they depend to some extent on the initial action-value estimates Q1(a)
● You can also say they are biased by their initial estimates
Optimal Initial Values
Encourage Exploration
● Initial action values can also be used as a way to encourage exploration
● Suppose that we set the action values for all actions to +5 in 10-armed testbed problem
instead of 0
■ As we discussed in part-1 that q*(a) true action values are selected from a normal
distribution with mean 0 and variance 1.
■ Initial estimate of +5 is optimistic but it encourages exploration because whichever
actions are initially selected the reward is less than starting estimates
■ This means the agent switches to other actions after being disappointed with the
rewards it is receiving
■ This means that all actions are tried several times before the value estimates
converge
■ The system does a fair amount of exploration even if greedy actions are selected all
the time
Upper Confidence Bound
● There is always uncertainty about the accuracy in the action-value estimates
● Hence exploration is needed
● Epsilon-greedy action selection forces non-greedy action to be tried but it does so indiscriminately,
with no preference for those that are nearly greedy or particularly uncertain
● It would be better to select non-greedy action according to their potential for actually being optimal
● This means taking into account
■ how close their estimates are to maximum value
■ and uncertainties in those estimates
One effective way of doing this is to select actions according to below equation
● ln t denotes the natural logarithm of t
● Nt(a) denotes the number of times that action a has been selected prior to time t
● the number c>0 controls the degree of exploration
● The square root term is a measure of uncertainty or variance in the estimate a's action-value
● If Nt(a) = 0, then a is considered to be a maximizing action.(means a needs to be tried more)
● The quantity inside the square bracket is thus the upper bound on the possible true value of action a, with c
determining the confidence level
● Each time a is selected the uncertainty is presumably reduced; Nt(a) is incremented and, as it appears in
the denominator of the uncertainty term, the term is decreased.
● However each time an action other than a is selected, t increases but the Nt(a) does not, that increases the
uncertainty
● The use of natural logarithm means that the increases get smaller over time but are unbounded
● All actions will eventually be selected but actions with lower value estimates, or that have already been
selected frequently, will be selected with decreasing frequency over time
So, in simple terms, UCB helps you explore uncertain options but gradually focuses on the best choice as you gather more
data
Gradient Bandits
Gradient Bandit algorithms are a reinforcement learning approach where the agent doesn’t estimate
action values directly. Instead, it learns preferences for actions using a gradient-based update rule.
These methods are useful when rewards are non-stationary or unbounded.
Here we consider learning a numerical preference Ht(a) for each action a.
The larger the preference, the more often that action is taken, but the preference has no interpretation
in terms of reward.
Instead of maintaining action-value estimates Q(a), Gradient Bandits use preferences H(a) for each
action. These preferences determine a softmax probability distribution (i.e., Gibbs or Boltzmann
distribution) for selecting actions. (The softmax function is a mathematical function that converts a
set of real numbers into a probability distribution.)
Higher preferences correspond to higher selection probabilities.
Why Use Softmax?
● Ensures all probabilities sum to 1.
● Assigns higher probability to actions with higher preferences.
● Keeps a balance between exploration and exploitation.
● Unlike ε-greedy, it assigns some probability to all actions rather than selecting a random one.
Simple Example: Softmax in Action
Imagine you are choosing between 3 restaurants, and you have assigned them the following preference scores:
Burger Place has the highest preference, so it
gets the highest probability (47%).
Pizza Spot has a lower preference but still has
a significant chance (32%).
Sushi Bar has the lowest probability (21%),
but it is still chosen sometimes.
Softmax transforms numerical scores into probabilities.
Updating Preferences in Softmax-Based Gradient Bandit Algorithms
In gradient bandit algorithms, we use preferences instead of estimated values (Q-values) to decide which action to take. These
preferences are updated over time based on the rewards received, which helps the algorithm learn which actions are better.
1. If an action gets a higher reward than expected ( Rt>Rtˉ ):
○ The preference for that action increases (more likely to
be chosen).
○ Other actions' preferences decrease slightly.
2. If an action gets a lower reward than expected ( Rt<Rtˉ):
○ The preference for that action decreases (less likely to be
chosen).
○ Other actions' preferences increase slightly.
○ This helps maintain a probability distribution over
actions and ensures exploration.
✅ When an action gets a high reward, its preference increases, making it more likely to be chosen.
✅ Other actions' preferences decrease slightly to balance exploration.
✅ Over time, the best actions are chosen more often while still exploring others.
Assignment