0% found this document useful (0 votes)

141 views70 pages

Unit 1 Reinforcement Learning

The document provides an overview of reinforcement learning (RL), detailing its elements such as policies, rewards, value functions, and models, along with practical applications in various fields like robotics, game AI, and healthcare. It discusses the importance of balancing exploration and exploitation in learning processes, illustrated through examples like Tic-Tac-Toe and the n-armed bandit problem. Additionally, it contrasts RL with evolutionary methods and highlights the significance of trial-and-error learning in optimizing decision-making.

Uploaded by

bmmithun688

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

141 views70 pages

Unit 1 Reinforcement Learning

Uploaded by

bmmithun688

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Reinforcement Learning

Unit 1
• Introduction: Elements of Reinforcement Learning, Limitations and
Scope, An Extended Example: Tic-Tac-Toe.

• Tabular Solution Methods: An n-Armed Bandit Problem, Action-Value

Methods, Incremental Implementation, Tracking a Nonstationary
Problem, Optimistic Initial Values, Upper-Confidence Bound Action
Selection, Gradient Bandits.
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Reinforcement learning is a type of machine learning
technique where a computer agent learns to perform a
task through repeated trial and error interactions with a
dynamic environment.

This learning approach enables the agent to make a

series of decisions that maximize a reward metric for the
task without human intervention and without being
explicitly programmed to achieve the task.

Key Idea:
•Trial & error learning
•Rewards & penalties guide learning
1. Robotics & Automation 🤖 : Self-learning robots (Boston Dynamics, warehouse automation)
2. Game AI 🎮 : AlphaGo, AlphaZero (beating human champions in Go, Chess)
3. Self-Driving Cars 🚗: Tesla Autopilot, Waymo self-driving systems
4. Finance & Stock Trading 📈 : Algorithmic trading bots (predicting stock trends)
5. Healthcare & Drug Discovery 🏥: AI-based personalized treatment plans (e.g., chemotherapy
optimization)
6. Smart Systems & IoT ⚡ : Reinforcement learning for smart cities and automation
7. Recommendation Systems 🎵📺: Personalized content recommendations (Netflix, YouTube, Spotify)
8. Natural Language Processing (NLP) & Chatbots 🤖💬 : AI assistants (ChatGPT, Alexa, Google
Assistant)
9. Cybersecurity 🔒: Detecting cyber threats using RL-based anomaly detection
10. Industrial Process Optimization 🏭 : Predictive maintenance for manufacturing equipment
Elements of Reinforcement Learning
Policy

• A policy defines the learning agent’s way of behaving at a given time.

Roughly speaking, a policy is a mapping from perceived states of the
environment to actions to be taken when in those states.

• In some cases the policy may be a simple function or lookup table,

whereas in others it may involve extensive computation such as a
search process.
Rewards

• A reward signal defines the goal in a reinforcement learning problem.

• On each time step, the environment sends to the reinforcement
learning agent a single number, a reward.
• The agent’s sole objective is to maximize the total reward it receives
over the long run.
• The reward signal thus defines what are the good and bad events for
the agent.
Value Functions

• Whereas the reward signal indicates what is good in an immediate

sense, a value function specifies what is good in the long run.
• The value of a state is the total amount of reward an agent can expect
to accumulate over the future, starting from that state.
• Rewards are in a sense primary, whereas values, as predictions of
rewards, are secondary.
• Without rewards there could be no values, and the only purpose of
estimating values is to achieve more reward.
Models

• Model is something that mimics the behavior of the environment, or

more generally, that allows inferences to be made about how the
environment will behave.
• Models are used for planning, by which we mean any way of deciding
on a course of action by considering possible future situations before
they are actually experienced.
• Methods for solving reinforcement learning problems that use models
and planning are called model-based methods.
Limitations & Scope
•RL methods are primarily focused on estimating value functions.
•Alternative methods like evolutionary algorithms can also solve RL problems without value
functions.
Evolutionary Methods
•Methods such as genetic algorithms and simulated annealing evolve policies over time.
•These methods don’t learn from individual actions but instead evaluate overall behavior.
•Effective in some cases, especially when state information is unclear.
Limitations of Evolutionary Methods
•Evolutionary methods do not use the structure of the RL problem (e.g., state-action
relationships).
•Less efficient compared to RL methods that use detailed behavioral interactions.
•Do not learn during interaction with the environment unless evolving learning algorithms.
Policy Gradient Methods
•Unlike evolutionary methods, policy gradient methods adjust policies
during interaction with the environment.
•These methods estimate how to improve a policy by adjusting
numerical parameters.
•More efficient because they use information from individual
interactions to guide learning.

Reinforcement Learning and Optimization

•RL agents aim to maximize the reward, but achieving the maximum
reward is not always possible.
•The focus is on improving the agent's reward over time, not necessarily
achieving the maximum.
AN EXTENDED EXAMPLE: TIC-TAC-TOE
Consider Tic-Tac-Toe as an example to illustrate reinforcement learning (RL) and contrast
it with classical approaches.
In the game, an RL-based player can learn to exploit an imperfect opponent’s mistakes to
maximize its chances of winning.
Classical methods like minimax assume an optimal opponent and may not adapt to real
gameplay, while dynamic programming requires a complete model of the opponent’s
behavior, which is often unavailable.
RL, however, learns from experience by playing multiple games, gradually estimating the
opponent's behavior and improving strategy over time.
This makes RL a practical approach for problems where opponent behavior is unknown.
Reinforcement learning (RL) using a value function can help a player improve at Tic-Tac-Toe by learning
from experience. Here’s a breakdown in simple terms:

1. Using a Value Function

▪ We create a table where each possible game state gets a
number (value) representing our estimated chance of winning.
▪ If a state has three X’s in a row, its value is 1 (since we won). If
three O’s are in a row or the board is full, its value is 0 (since we
lost or tied).
▪ All other states start with a guess of 0.5 (a 50% chance of
winning).

2. Playing and Learning

• Each time we play, we look at possible moves and check their
values.
• Most of the time, we choose the move with the highest value
(best chance of winning).
• Sometimes, we pick a random move to explore new
situations—this is called exploration.
3. Updating the Value Function (Learning from Mistakes and Successes)
• After every move, we update the values in our table.
• If a move leads to a good outcome, we increase the value of the previous state. If it leads to a bad
outcome, we lower the value.
• This update happens step by step, gradually improving our estimates over many games.

4. Why This Works Well

• Over time, our values get more accurate, helping us play optimally against a given opponent.
• If our opponent changes their strategy, we can still adapt by keeping a small learning rate.

5. Comparison with Evolutionary Methods

• Evolutionary methods test entire strategies over many games before making any changes, without considering
individual moves.
• Value function methods learn from each move in real-time, making them faster and more efficient at improving
gameplay.
Analyse how learning occurs in the Tic-Tac-Toe example using value functions.
Numerical example of Tic-Tac-Toe using Reinforcement Learning (RL) with a Value Function.

Step 1: Initial Setup – Assigning Values to Game States

We create a table where each possible Tic-Tac-Toe board state has a value representing our estimated chance of
winning.
•Winning states (e.g., three Xs in a row) get a value of 1.
•Losing states (e.g., three Os in a row) get a value of 0.
•All other states start at 0.5 (since we don’t know yet if they are good or bad).
Step 2: Playing a Game with Learning
We now play a game against an opponent and update our values based on the result.
Game Progression Example
Move 1 (Starting State, Random Selection)

•RL chooses a move randomly (exploration).

•Let's say it picks the middle square (1,1).
The current estimated value of this state is 0.5

Move 2 (Opponent’s Turn)

Opponent (O) moves to position (0,0):
No learning happens yet because it’s not our turn.

Move 3 (Greedy Move Selection)

•We check all possible moves and pick the one with the highest
value.
•Assume our table values suggest moving to (2,2) is best.
•We move X to (2,2).
Move 4 (Opponent Moves)
Opponent (O) moves to (0,1):

Again, no learning happens yet.

Move 5 (Greedy Move, Near Win)

We move X to (1,2):
The estimated value of this state increases since it looks closer
to a win.

Move 6 (Opponent Moves)

Opponent (O) moves to (0,2):
The learning rate (denoted as α) is a parameter that controls how much
the agent updates its knowledge after receiving new information.

Changes are based on a difference, V (s) − V (s’), between estimates at

two different times

Role of Learning Rate (α):

● High Learning Rate (e.g., α = 0.9):

○ Faster learning, but can be unstable.
○ The agent adapts quickly but may overreact to new information.
● Low Learning Rate (e.g., α = 0.01):
○ Slower learning, but more stable.
○ The agent updates values slowly and does not forget past
experiences quickly.
Tabular Solution Methods
Multi-arm Bandits
• Reinforcement learning (RL) is different from other types of
learning because it learns from feedback that evaluates
actions rather than directly telling which action is correct.
This means RL relies on trial and error to find good actions.

• There are two types of feedback:

1. Evaluative feedback – tells how good an action was but not

if it was the best or worst option. This is common in
optimization methods like evolutionary algorithms.

2. Instructive feedback – directly tells the correct action to

take, regardless of what was chosen. This is used in
supervised learning, such as neural networks.

These two types of feedback are usually separate but can sometimes be mixed.
To understand evaluative feedback better, we first study a simple version of reinforcement learning called the n-armed
bandit problem. This problem helps introduce key learning methods before moving on to more complex RL scenarios where
actions must be chosen in different situations.
An One Armed Bandit Problem
A one-armed bandit is a nickname for a slot machine, the
kind you see in casinos. It gets this name because:

1. "One-armed" – Traditional slot machines have a single

lever (or "arm") on the side that you pull to spin the
reels. Modern machines often use buttons instead, but
the name remains.
2. "Bandit" – The machine takes your money and rarely
pays out big winnings, just like a thief or "bandit."

How Does a One-Armed Bandit Work?

● You insert money (coins, bills, or a card with credits).

● You pull the lever (or press a button).
● The reels spin and then stop at random symbols.
● If the symbols match a winning combination, you
receive a payout. Otherwise, you lose your bet.
n-Armed Bandit Problem
Imagine you walk into a casino filled with slot machines (also called
"one-armed bandits" because of their lever). However, instead of just one
machine, you have n different slot machines to choose from. Each machine
gives out different amounts of money when you pull its lever, but you don’t
know in advance which one pays the most. Your goal is to maximize your
total winnings over a fixed number of turns (let’s say 1000).
Now, think about how you should play to win as much as possible.

Assume the player is well aware of all

these probabilities of each machine
and plays the first machine 1000 times
The non triviality of the multi armed bandit problem lies in the fact that we(the agent) cannot access the true bandit probability
distributions- all learning is carried via trial & error and value estimation.

So we now have two method :

1. Play the machine that has given you the highest rewards so far (this is called exploitation – taking advantage of what you
know).
2. Try a different machine to see if it gives you a higher reward (this is called exploration – gathering more information).

Agent tries only Machine 2 and

thinks 30% success is the best.
Exploitation
Exploration: Tries all machines 250 times each.

This creates a conflict between exploration and exploitation:

● If you always exploit(Greedy Selection), you might stick to a machine that seems good but isn’t actually the best.
● If you explore too much, you might waste time on bad machines and miss out on higher rewards from the best machine.
● The key challenge is finding the right balance between exploration and exploitation.
Balancing exploration (trying new actions) and exploitation (choosing the
best-known action) is a key challenge in reinforcement learning.

● Short-term vs. Long-term Tradeoff: Exploring may lead to lower rewards

initially, but it helps discover better actions that can be exploited for higher
rewards in the long run.
● Complexity: The best choice between exploration and exploitation
depends on factors like estimated values, uncertainties, and remaining time
steps.
● Limitations of Advanced Methods: Many sophisticated techniques
assume stable conditions and prior knowledge, which may not hold in
real-world applications.
● Focus : Instead of perfect optimization, the goal is to find simple strategies
that effectively balance exploration and exploitation, as always exploiting is
suboptimal.
Real Life Bandit Problems
The multi-armed bandit (MAB) problem models decision-making under uncertainty, making it useful in
various real-world scenarios where we must balance exploration (trying new things) and exploitation (using
the best-known option).

• Design of Experiments (Clinical Trials): A doctor is testing multiple treatments for a disease but doesn’t
know which one works best.The doctor tries different treatments (exploration) while favoring the most
effective one (exploitation).
• Online Ad placement :An advertiser has multiple ad versions but doesn’t know which one gets the most
clicks.
• Website page persona: A website has multiple page designs but doesn’t know which one converts best
(e.g., more sign-ups, purchases).
• Games: A game AI needs to choose the best strategy to win or adapt to a player's skill level.
Action-Value Methods- Greedy Method/ Sample Average Method

Lets explore simple balancing methods for the n-armed bandit problem and show that they work much better than methods that
always exploit.

Action-value methods are simple techniques used in reinforcement learning to estimate the value of different actions and make
decisions based on these estimates.
A gambler is playing a 5-armed bandit machine, where each arm provides a reward based on an unknown probability
distribution. The gambler plays for 15 rounds, receiving the following sequence of actions and rewards:

Compute the estimated action values Q(a) for each action after 15 rounds.
Which action has the highest estimated value, and should the gambler always select it moving forward?
Step 1: Compute the Estimated Action Values Q(a)

The estimated action value Qt(a) is calculated as:

where:

● ∑R is the sum of all rewards received for that action.

● Nt(a) is the number of times the action has been selected.

Let's compute the values step by step for each action.

Action A:

● Rewards received: 6.0, 5.5, 4.8, 5.2

● Number of times selected: 4

Action B:

● Rewards received: 7.5, 5.8, 6.2, 6.1

● Number of times selected: 4
Action C:

● Rewards received: 3.8, 4.5, 4.1

● Number of times selected: 3

Action D:

● Rewards received: 6.9, 7.4, 6.8

● Number of times selected: 3

Action E:

● Rewards received: 2.5

● Number of times selected: 1
Step 2: Identify the Best Action

The final estimated values are:

The action with the highest estimated value is D (7.03).

If we were following a greedy strategy, we would now always select D.

Limitation : If we always select the action with the highest estimated value (D), we risk missing better opportunities
because:

● The estimates might still be inaccurate due to insufficient exploration.

● Actions B and A have relatively high values, and their true values might be higher than D.

Instead, the gambler should balance exploration and exploitation.

A digital marketing team is testing four different ad campaigns (A, B, C, D) for a product. Each ad generates revenue
based on click-through rates (CTR), but the true effectiveness of each ad is unknown. The team follows a greedy
method to maximize immediate revenue.

Given Data: Each ad has an unknown true average revenue per click, but the team only sees the revenue from each
trial. They start testing with random ad selections and then switch to a greedy strategy (always picking the highest
estimated ad).

They run 12 trials and record the following selections and observed revenues (in dollars):

1. Compute the estimated revenue for each ad after 12 trials

using the greedy method.
2. Identify which ad will be selected in the next round under
the greedy approach.
3. Explain the risk of using the greedy method in this case.
Compute the Estimated Revenue for Each Ad

Since Ad D has the highest estimated revenue

(7.25), the greedy method will always select Ad
D in the next round.

Risks of the Greedy Method

Although the greedy method maximizes short-term revenue, it has two major risks:

1. Lack of Exploration:
○ Ad C was tested only once. It might actually be a high-revenue option, but because it was not tested
enough, it gets ignored.
○ The same applies to Ad A, which was tested only twice.
2. Potential for a Suboptimal Choice:
○ The greedy method assumes Ad D is the best, but the estimates may still be inaccurate.
○ A better strategy would be ε-greedy, where some fraction of selections (e.g., 10%) is used for exploring
other ads.
Greedy Play : Suppose the reinforcement learning player was greedy, that is, it always played the move
that brought it to the position that it rated the best. Would it learn to play better, or worse, than a
nongreedy player? What problems might occur?

Lack of Exploration
Greedy player focuses only on immediate rewards.
Prevents discovering new, potentially better strategies.

Overfitting to Current Knowledge

Relies on inaccurate initial value estimates. (mis-estimates)
Reinforces suboptimal actions based on misjudged rewards.
Local Optima
Sticks to strategies that seem optimal in the short term.
May miss better long-term strategies that require suboptimal moves initially.
Slow Learning and Convergence
Inefficient learning due to limited exploration.
May take longer to converge to an optimal strategy.
Non-greedy Players Perform Better
Non-greedy strategies (e.g., epsilon-greedy) balance exploration and exploitation.
More likely to discover better overall strategies over time.
Epsilon Greedy Method
The ε-greedy algorithm is a simple and effective action-selection strategy used in reinforcement learning (RL) and
multi-armed bandit problems. It is designed to balance two conflicting objectives:

1. Exploration – Trying new actions to discover potentially better rewards.

2. Exploitation – Choosing the best-known action to maximize immediate rewards.

How the ε-Greedy Algorithm Works

First choose each arm once

Then at each step, the agent decides between:

● Greedy action (exploitation): With probability (1−ϵ), the agent selects the action with the highest estimated value.
● Random action (exploration): With probability ϵ, the agent selects a random action.
Consider a 2 arm bandit
Step 1 : Choose each arm once - Arm 1

Step 2 : Choose each arm once - Arm 2

Step 3 Assume ε = 1
Choose empirically best arm with probability 0%
Choose an arm uniformly at random with probability 100%

Step 3 Assume ε =0.9

Choose empirically best arm with probability 10%
Choose an arm uniformly at random with probability 90%
How to choose between best arm and random arm
Assume a random number r= np.random.rand() = 0.48
Choose empirically best arm if r >= ε
Choose an arm uniformly at random if r < ε i e 0.48< 0.9 , so choose randomly

Step 4 Assume ε=0.5

Choose empirically best arm with probability 50%
Choose an arm uniformly at random with probability 50%
How to choose between best arm and random arm
Assume we get a random number r= np.random.rand() = 0.8
Choose empirically best arm if r >= ε i e 0.8> 0.5 , so choose best arm
Choose an arm uniformly at random if r < ε
Decay of ϵ:

At the beginning, the agent explores more to gather information.

Over time, exploration reduces, and the agent exploits the best-known option.
ε is adaptively reduced, leading to efficient learning.
This approach balances exploration and exploitation, ensuring the algorithm does not get stuck in
suboptimal decisions.
Decay of ϵ:

● The term CK/tΔ^2 represents how ϵ decays over time. As t (the number of actions taken so far) increases, the term becomes smaller, meaning ϵ gets
smaller as time goes on. This means the agent will gradually focus more on exploitation (i.e., choosing the best arm) over time, rather than exploring
random arms.
● The factor CK scales the decay. C can be adjusted to control the rate of decay, and K is the number of arms, ensuring that more arms will influence
the exploration rate.

The Role of Δ^2:

● The term Δ^2 represents the squared difference between the optimal arm's reward (Q∗) and the estimated reward of a given arm (Q(a)).
● A larger Δ (larger gap between optimal and suboptimal arms) means the agent has more confidence in the optimal arm and is more likely to exploit it.
This encourages the agent to focus more on exploitation as Δ grows.
● If Δ is small (the rewards of the arms are similar), the agent will explore more to better distinguish between the arms.

Optimal Arm: This is the arm (action) that provides the highest expected reward over time. It is the best arm among all available arms in terms of
long-term return.

Suboptimal Arms: These are the arms that provide lower expected rewards compared to the optimal arm.

Delta (Δ) - The Gap Between Optimal and Suboptimal Arms: Δ represents the difference between the expected reward of the optimal arm (Q∗) and the
expected reward of a suboptimal arm (Q(a)).

Δ=Q∗−Q(a)

A larger Δ means that the optimal arm has a much higher expected reward than a suboptimal arm, giving the agent more confidence that choosing the
optimal arm will consistently yield higher returns.

A smaller Δ means that the rewards of the optimal arm and the suboptimal arms are closer, which makes it harder for the agent to distinguish between them,
requiring more exploration to learn which arm is truly the best.
Limitations of the Action-Value Method:

1. Memory Usage:
○ The method requires maintaining a record of the rewards for each action, which
could become memory-intensive if the number of actions is large, especially in
real-world problems.
2. Computational Cost:
○ In its straightforward form, the action-value method requires recalculating the
average value each time an action is selected, which can be computationally
expensive as the number of rewards increases for each action.
Incremental implementation
The solution to this is the use of incremental updating to compute the action-value estimate without having to
store all the past rewards.

For some action, let Qk denote the estimate for its kth reward, that is, the average of its first k − 1 rewards. Given
this average and a kth reward for the action, Rk, then the average of all k rewards can be computed by
Advantages of incremental Approach
• The key benefit of this incremental formula is that it only requires storing the
current average Qk and the new reward Rk, avoiding the need to store and sum all
previous rewards.
• The memory usage remains constant regardless of how many times the action is
selected, and the computation per step is constant (O(1)).
• The action value is updated efficiently without needing to revisit all previous
rewards.
Tracking a Nonstationary Problem

Consider example

Imagine you run an online store and track which products customers like the most. If you simply average all past sales, your
recommendations may become outdated.

For example, if winter coats were popular months ago but now it's summer, those past sales shouldn't influence today's
recommendations as much. Instead, using a constant step-size update, you can give more weight to recent purchases (like
sunglasses or swimsuits) and adapt quickly to changing customer preferences.

This way, your store always suggests the most relevant products, keeping customers happy!
Tracking a Nonstationary Problem

In reinforcement learning, the non-stationary n-armed bandit problem is a variation of the classical
multi-armed bandit (MAB) problem, where the reward distribution of each arm changes over time. This
poses an additional challenge since optimal strategies must adapt to these changes instead of converging to
a fixed optimal arm.

Changing Reward Distributions

● The expected reward of each arm may drift over time due to external factors.
● Changes could be abrupt (e.g., sudden shifts) or gradual (e.g., slow drift).
● Classical stationary bandit algorithms (like ε-greedy) assume fixed reward distributions.
● In non-stationary environments, older observations may become irrelevant, requiring
algorithms to be more adaptive.

In such cases it makes sense to weight recent rewards more heavily than long-past ones. One of the most
popular ways of doing this is to use a constant step-size parameter. Instead of treating all past rewards
equally, we use a weighted average, where recent rewards are given more importance.
Consider the incremental Implementation equation for estimating the
action value

The incremental update rule for updating an average Qk of the k − 1 past rewards is modified to be
How Are the Weights Assigned?

● The most recent reward Rₖ gets the highest weight.

● The older rewards Rᵢ get progressively lower weights.
● The weights decay exponentially (shrink quickly as time passes).

where:

● α (alpha) = Step size (a small number like 0.1 or 0.01). It controls how much importance the new reward gets.
● (1−α) = A number less than 1, meaning that as time passes, the weight of past rewards gets smaller and smaller.
● k−i = The difference between the current step k and the past step i, meaning how long ago the reward was received.
Optimal Initial Values in Reinforcement
Learning
When training a reinforcement learning (RL) agent, the initial values of action estimates
(Q(a)) can significantly impact learning efficiency. Optimistic Initialization is a technique
where we start with high initial values to encourage exploration early on.

● Previous methods use 0 as the default initial value for the q value estimates
● This means they depend to some extent on the initial action-value estimates Q1(a)
● You can also say they are biased by their initial estimates
Optimal Initial Values
Encourage Exploration
● Initial action values can also be used as a way to encourage exploration
● Suppose that we set the action values for all actions to +5 in 10-armed testbed problem
instead of 0
■ As we discussed in part-1 that q*(a) true action values are selected from a normal
distribution with mean 0 and variance 1.
■ Initial estimate of +5 is optimistic but it encourages exploration because whichever
actions are initially selected the reward is less than starting estimates
■ This means the agent switches to other actions after being disappointed with the
rewards it is receiving
■ This means that all actions are tried several times before the value estimates
converge
■ The system does a fair amount of exploration even if greedy actions are selected all
the time
Upper Confidence Bound
● There is always uncertainty about the accuracy in the action-value estimates
● Hence exploration is needed
● Epsilon-greedy action selection forces non-greedy action to be tried but it does so indiscriminately,
with no preference for those that are nearly greedy or particularly uncertain
● It would be better to select non-greedy action according to their potential for actually being optimal
● This means taking into account
■ how close their estimates are to maximum value
■ and uncertainties in those estimates
One effective way of doing this is to select actions according to below equation
● ln t denotes the natural logarithm of t
● Nt(a) denotes the number of times that action a has been selected prior to time t
● the number c>0 controls the degree of exploration
● The square root term is a measure of uncertainty or variance in the estimate a's action-value
● If Nt(a) = 0, then a is considered to be a maximizing action.(means a needs to be tried more)
● The quantity inside the square bracket is thus the upper bound on the possible true value of action a, with c
determining the confidence level
● Each time a is selected the uncertainty is presumably reduced; Nt(a) is incremented and, as it appears in
the denominator of the uncertainty term, the term is decreased.
● However each time an action other than a is selected, t increases but the Nt(a) does not, that increases the
uncertainty
● The use of natural logarithm means that the increases get smaller over time but are unbounded
● All actions will eventually be selected but actions with lower value estimates, or that have already been
selected frequently, will be selected with decreasing frequency over time
So, in simple terms, UCB helps you explore uncertain options but gradually focuses on the best choice as you gather more
data
Gradient Bandits

Gradient Bandit algorithms are a reinforcement learning approach where the agent doesn’t estimate
action values directly. Instead, it learns preferences for actions using a gradient-based update rule.
These methods are useful when rewards are non-stationary or unbounded.

Here we consider learning a numerical preference Ht(a) for each action a.

The larger the preference, the more often that action is taken, but the preference has no interpretation
in terms of reward.

Instead of maintaining action-value estimates Q(a), Gradient Bandits use preferences H(a) for each
action. These preferences determine a softmax probability distribution (i.e., Gibbs or Boltzmann
distribution) for selecting actions. (The softmax function is a mathematical function that converts a
set of real numbers into a probability distribution.)

Higher preferences correspond to higher selection probabilities.

Why Use Softmax?
● Ensures all probabilities sum to 1.
● Assigns higher probability to actions with higher preferences.
● Keeps a balance between exploration and exploitation.
● Unlike ε-greedy, it assigns some probability to all actions rather than selecting a random one.
Simple Example: Softmax in Action
Imagine you are choosing between 3 restaurants, and you have assigned them the following preference scores:

Burger Place has the highest preference, so it

gets the highest probability (47%).

Pizza Spot has a lower preference but still has

a significant chance (32%).

Sushi Bar has the lowest probability (21%),

but it is still chosen sometimes.

Softmax transforms numerical scores into probabilities.

Updating Preferences in Softmax-Based Gradient Bandit Algorithms

In gradient bandit algorithms, we use preferences instead of estimated values (Q-values) to decide which action to take. These
preferences are updated over time based on the rewards received, which helps the algorithm learn which actions are better.

1. If an action gets a higher reward than expected ( Rt>Rtˉ ):

○ The preference for that action increases (more likely to

be chosen).
○ Other actions' preferences decrease slightly.
2. If an action gets a lower reward than expected ( Rt<Rtˉ):

○ The preference for that action decreases (less likely to be

chosen).
○ Other actions' preferences increase slightly.
○ This helps maintain a probability distribution over
actions and ensures exploration.
✅ When an action gets a high reward, its preference increases, making it more likely to be chosen.

✅ Other actions' preferences decrease slightly to balance exploration.

✅ Over time, the best actions are chosen more often while still exploring others.
Assignment

Assignment 15 Modern AI
No ratings yet
Assignment 15 Modern AI
3 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
RL Unit 1
No ratings yet
RL Unit 1
44 pages
RL Unit-1
No ratings yet
RL Unit-1
52 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
37 RL
No ratings yet
37 RL
18 pages
Unit 4
No ratings yet
Unit 4
56 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Module 1
No ratings yet
Module 1
72 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Unit 3
No ratings yet
Unit 3
12 pages
21ai020 & Reinforcement Learning UNIT 1-LM:1
No ratings yet
21ai020 & Reinforcement Learning UNIT 1-LM:1
8 pages
Unit 5
No ratings yet
Unit 5
45 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
ML Assignment 2
No ratings yet
ML Assignment 2
6 pages
Multi-Agent Learning Dynamics
No ratings yet
Multi-Agent Learning Dynamics
26 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Sara Reinforcement Learning
No ratings yet
Sara Reinforcement Learning
69 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
No ratings yet
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
35 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
ML Unit 5 at VS
No ratings yet
ML Unit 5 at VS
29 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
AI Reinforcement Learning Guide
No ratings yet
AI Reinforcement Learning Guide
27 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
LearnAlgorithms LT
No ratings yet
LearnAlgorithms LT
95 pages
Reinforcement Learning Enhanced
No ratings yet
Reinforcement Learning Enhanced
3 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
Module 01
No ratings yet
Module 01
66 pages
UNIT-V-Reinforcement Learning
No ratings yet
UNIT-V-Reinforcement Learning
4 pages
Unit4 (AI) 2024 Docx-1
No ratings yet
Unit4 (AI) 2024 Docx-1
22 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
19 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
4 pages
L-14 - Reinforcement-L-d-07062024-111949am
No ratings yet
L-14 - Reinforcement-L-d-07062024-111949am
22 pages
Reinforcement Learning, Q-Learning
No ratings yet
Reinforcement Learning, Q-Learning
20 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
64 pages
Unit 5
No ratings yet
Unit 5
10 pages
UNIT V Reinforcement Learning
No ratings yet
UNIT V Reinforcement Learning
8 pages
What Is Reinforcement Learning
No ratings yet
What Is Reinforcement Learning
15 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
A Personal Guide To Electronic Slot Machines
100% (1)
A Personal Guide To Electronic Slot Machines
57 pages
ISMSBrochure 2011
No ratings yet
ISMSBrochure 2011
19 pages
Precisely What S The Average RTP of Just About All Online Casino Video Games Instructions On Gambling Within Thailand Online Betting Lawsdiixx
No ratings yet
Precisely What S The Average RTP of Just About All Online Casino Video Games Instructions On Gambling Within Thailand Online Betting Lawsdiixx
10 pages
367151top Guidelines of Ggbook Online Casino
No ratings yet
367151top Guidelines of Ggbook Online Casino
1 page
Learner - Guide - Slots Operation - 258069
No ratings yet
Learner - Guide - Slots Operation - 258069
75 pages
Probability in Casinos Final
No ratings yet
Probability in Casinos Final
36 pages
BC - CO Crypto Casino Games & Casino Slot Games - Crypto Gambling 10
No ratings yet
BC - CO Crypto Casino Games & Casino Slot Games - Crypto Gambling 10
1 page
Art of Bitfulness
No ratings yet
Art of Bitfulness
225 pages
Addiction by Design Machine Gambling in
No ratings yet
Addiction by Design Machine Gambling in
3 pages
Cosmobet Data
No ratings yet
Cosmobet Data
2 pages
Wilson-Huuuge Complaint
No ratings yet
Wilson-Huuuge Complaint
17 pages
Casino Slot Management System
No ratings yet
Casino Slot Management System
2 pages
Casino Players Profiling: A Conceptualization Attempt
100% (1)
Casino Players Profiling: A Conceptualization Attempt
7 pages
Marketing-Driven Casino Operational Plan
100% (2)
Marketing-Driven Casino Operational Plan
96 pages
RocketPlay Casino Edited (2) .Edited
No ratings yet
RocketPlay Casino Edited (2) .Edited
5 pages
Konami
67% (3)
Konami
108 pages
How To Win at Mystery Card Roulette
No ratings yet
How To Win at Mystery Card Roulette
2 pages
Slot Management Ebook
100% (1)
Slot Management Ebook
34 pages
Pachislo Manual
100% (3)
Pachislo Manual
30 pages
Brochure 2024
No ratings yet
Brochure 2024
80 pages
The Rough Guide To Casino Offers
No ratings yet
The Rough Guide To Casino Offers
44 pages
W&W Rules
No ratings yet
W&W Rules
5 pages
NT Handbook
No ratings yet
NT Handbook
127 pages
Predatory Female Shannon 20
No ratings yet
Predatory Female Shannon 20
199 pages
Slot Machine RTP Optimization and Symbols Wins Equalization With Discrete Differential Evolution
No ratings yet
Slot Machine RTP Optimization and Symbols Wins Equalization With Discrete Differential Evolution
9 pages
CPA Exam Case: Accounting & Cash Controls
No ratings yet
CPA Exam Case: Accounting & Cash Controls
4 pages
ReNeA Manual 810011a
No ratings yet
ReNeA Manual 810011a
168 pages
Philippines Casino Industry
No ratings yet
Philippines Casino Industry
42 pages
How To Apply These Ideas To Business
100% (1)
How To Apply These Ideas To Business
19 pages
Casino Management Project Program
No ratings yet
Casino Management Project Program
4 pages

Unit 1 Reinforcement Learning

Uploaded by

Unit 1 Reinforcement Learning

Uploaded by

Reinforcement Learning

• Tabular Solution Methods: An n-Armed Bandit Problem, Action-Value

This learning approach enables the agent to make a

• A policy defines the learning agent’s way of behaving at a given time.

• In some cases the policy may be a simple function or lookup table,

• A reward signal defines the goal in a reinforcement learning problem.

• Whereas the reward signal indicates what is good in an immediate

• Model is something that mimics the behavior of the environment, or

Reinforcement Learning and Optimization

1. Using a Value Function

2. Playing and Learning

4. Why This Works Well

5. Comparison with Evolutionary Methods

Step 1: Initial Setup – Assigning Values to Game States

•RL chooses a move randomly (exploration).

Move 2 (Opponent’s Turn)

Move 3 (Greedy Move Selection)

Again, no learning happens yet.

Move 5 (Greedy Move, Near Win)

Move 6 (Opponent Moves)

Changes are based on a difference, V (s) − V (s’), between estimates at

Role of Learning Rate (α):

● High Learning Rate (e.g., α = 0.9):

• There are two types of feedback:

1. Evaluative feedback – tells how good an action was but not

2. Instructive feedback – directly tells the correct action to

1. "One-armed" – Traditional slot machines have a single

How Does a One-Armed Bandit Work?

● You insert money (coins, bills, or a card with credits).

Assume the player is well aware of all

So we now have two method :

Agent tries only Machine 2 and

This creates a conflict between exploration and exploitation:

● Short-term vs. Long-term Tradeoff: Exploring may lead to lower rewards

The estimated action value Qt(a) is calculated as:

● ∑R is the sum of all rewards received for that action.

Let's compute the values step by step for each action.

● Rewards received: 6.0, 5.5, 4.8, 5.2

● Rewards received: 7.5, 5.8, 6.2, 6.1

● Rewards received: 3.8, 4.5, 4.1

● Rewards received: 6.9, 7.4, 6.8

● Rewards received: 2.5

The final estimated values are:

The action with the highest estimated value is D (7.03).

● The estimates might still be inaccurate due to insufficient exploration.

Instead, the gambler should balance exploration and exploitation.

1. Compute the estimated revenue for each ad after 12 trials

Since Ad D has the highest estimated revenue

Risks of the Greedy Method

Overfitting to Current Knowledge

1. Exploration – Trying new actions to discover potentially better rewards.

How the ε-Greedy Algorithm Works

First choose each arm once

Then at each step, the agent decides between:

Step 2 : Choose each arm once - Arm 2

Step 3 Assume ε =0.9

Step 4 Assume ε=0.5

At the beginning, the agent explores more to gather information.

The Role of Δ^2:

Changing Reward Distributions

● The most recent reward Rₖ gets the highest weight.

Here we consider learning a numerical preference Ht(a) for each action a.

Higher preferences correspond to higher selection probabilities.

Burger Place has the highest preference, so it

Pizza Spot has a lower preference but still has

Sushi Bar has the lowest probability (21%),

Softmax transforms numerical scores into probabilities.

1. If an action gets a higher reward than expected ( Rt>Rtˉ ):

○ The preference for that action increases (more likely to

○ The preference for that action decreases (less likely to be

✅ Other actions' preferences decrease slightly to balance exploration.

You might also like