0% found this document useful (0 votes)

20 views59 pages

10 ReinforcementLearning

The document provides an overview of reinforcement learning, emphasizing the need for agents to explore different strategies to maximize rewards based on feedback from their environment. It discusses key concepts such as state, action, reward, policy, and the importance of Markov Decision Processes in decision-making. Additionally, it includes examples and methods for action selection and value function updates, particularly focusing on Q-learning.

Uploaded by

jiofijiofi0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views59 pages

10 ReinforcementLearning

Uploaded by

jiofijiofi0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

10 REINFORCEMENT LEARNING

Fall 2023 CS54118 Machine Learning

Credits
1. B2: Machine learning: an algorithmic perspective. 2nd Edition,
Marsland, Stephen. CRC press, 2015
2. https://www.samyzaf.com/ML/rl/qmaze.html
3. http://www.gwydir.demon.co.uk/jo/maze/makemaze/index.htm
4. https://www.freecodecamp.org/news/an-introduction-to-q-learning-
reinforcement-learning-14ac0b4493cc/
Assignment
Read:
B2: Chapter 11

Problems:
B2:11.1, 11.2, 11.4
Reinforcement Learning
 Information is provided about whether or not the answer is correct, but
not how to improve it.
 Reinforcement learner has to try out different strategies and see which
work best.
 Tryingout ≡ searching, is a fundamental part of any reinforcement learner
 Searches over the state space of possible inputs and outputs in order to try
to maximize a reward.
 Reinforcement Learning: Comes from Animal intelligence
 You repeat an action that gives you more satisfaction (or reward). Such an
action gets reinforced with the situation that caused it.
An Example
 Robot mouse has no
idea of the room
layout – obstructions
and ways
 Robot mouse only
knows how to
recognize block with
cheese – the end
goal.
Restricted Actions
 Can move only in
one of the four ways
 Approach: explore
and learn
Learning Part
 Cell (3,5) and action 2
(going right) results in
punishment
 Punishmentcan be denoted
by some –ve value.
 Cell (3,9) and action 3
(going down) results in just
another block
 That is, neither hitting an
obstruction nor reaching the
goal
 Can be represented by a
zero value
 Cell (9,10) and action 3
(going down) results in
reaching the goal
 Can be represented by a
high positive value
 This also makes the cell
(9,10) very valuable
because it has a way to
reach the goal
 Cell (9,9) and action 2
(going right) results in
reaching the valuable cell
(9,10)
 Can be given a high positive
value
 Hence, cell (9,9) also
becomes valuable.
 How much “history” we
want to maintain is also a
choice.
 After learning sufficiently,
the best action for each
state will look something
like this
 Three things
 Current input or the state
 Possible things that can be done or
the actions
 Aim is to maximize reward

 Agent: that is doing or learning

 Environment: where the agent acts
• State: sensor readings
• May not tell everything to the robot
• There may be noise/inaccuracies
• Actions: possible ways in which the robot can
drive its motors
• Actions on the environment
• Reward: how well it navigates to the goal
with minimal crashing
 Reward can be delayed or it can be instant
 E.g. a robot travelling a maze vs. a robot driving a car
 Policy: Choice of action that should be taken
 Should be a combination of exploitation and exploration
Policy: exploitation vs. Start

exploration
 Exploitation
 Takethe best action
learnt so far

 Exploration
 Givesub-optimal actions a
chance
 Reward can be delayed or it can be instant
 E.g. a robot travelling a maze vs. a robot driving a car
 Policy: Choice of action that should be taken
 Should be a combination of exploration and exploitation
 Absorbing state: your goal state, when the solution you were searching
for is found
 Also called Terminal or Accepting state.
 State space: set of all states that the learner can experience
 Action space: set of all actions that the learner can take
EXAMPLE: GETTING LOST

• You know that 𝐹 is the

absorbing state ,because you
are vaguely familiar with it
• You decide to reward yourself
with chips only if you reach 𝐹
 You don’t have the city
map with you, hence,
you are not aware of
the following reward
matrix

Action or Next State

Carrots and Sticks: The Reward Function
 Reward Function
 Input:
Current State and Chosen Action
 Output: Numerical reward based on them

 It is generated by the environment around

the learner
 It is not internal to the learner
 Two parts
 An intermediate part (at every step or so)
 A pay-off in the end
 The reward tells the learner
what the goal is, not how the
goal should be achieved, which
would be supervised learning.
 Usually a bad idea to include
sub-goals like speed up of
learning
 Learner can find methods of
achieving sub-goals without
actually achieving the real goal
 For continual tasks (e.g., learning to walk): there is no terminal state
 In general, we want to predict the reward into the infinite future
 Solution: Discounting:
 We discount future rewards depending upon how “far” they are in the future

Where, 0 < 𝛾 < 1 is the discounting factor

Action Selection
 Each action is assigned a value based on
 Thecurrent state
 How it has been rewarded in the past

 A reward action 𝑎 has received on being taken 𝑡 times form the state
𝑠: 𝑄𝑠,𝑡 (𝑎)
 Will eventually converge to true predictions
Based on prediction of 𝑄𝑠,𝑡 (𝑎), there are three methods of choosing 𝑎
 Greedy: pick action with highest 𝑄𝑠,𝑡 (𝑎)

 𝛜-Greedy: similar to greedy, but small probability where we pick some

other action at random

 Works better than greedy in practice.
 Soft-max: refinement of 𝜖-Greedy

 𝝉 (temperature):
 Large 𝜏: all actions have similar probability
 Small 𝜏: individual probabilities matter more
Policy
 Choice of which action to take in each state in order to get optimal
results: 𝜋
 It is a mapping from states to actions.
 Goal: learn better 𝜋 for each state 𝑠𝑡 as we proceed.
 Two things:
 How much past information we need to know while getting into the current
state (use of Markov Decision Process)
 How we assign a value to the current state
MARKOV DECISION PROCESSES
 Is the information of current state sufficient to compute reward for the
next move (e.g., chess)
 Such a state is called Markov state

OR
 We need to consider how we got to the current state, i.e., a sense of
history needs to be stored.
 A reinforcement learning problem that follows Markov state property
is known as a Markov Decision Process (MDP)
 The number of possible states and actions are assumed to be finite.
 Markov Chain: States + transition probabilities
 Markov Decision Process: Extension of Markov Chain with actions and
rewards
Markov Chain

A simple example of a Markov decision process to decide on the state of

your mind tomorrow given your state of mind today.
Markov Decision Process

• Addition: Selecting an action

does not guarantee a state.
• We add action nodes coming
out of state nodes.
• Multiple state nodes can be
connected to an action node.

Transition Diagram

MDP for getting lost example

VALUES
 Value denotes the expected future reward the
reinforcement learner is trying to maximize
 Two ways:
 State-value function 𝑉(𝑠): average the reward across all of the actions that can be
taken

 Action-value function 𝑄(𝑠, 𝑎): consider each possible action that can taken from the
current state separately.
 Two ways:
 State-value function 𝑉(𝑠): average the reward across all of the actions that can be
taken

 Action-value function 𝑄(𝑠, 𝑎): consider each possible action that can taken from the
current state separately.

 State-value function is less accurate but easier to compute

 Action-value function is more accurate but requires more information to compute
 Now, we have two problems
 Deciding which action to take, i.e., the policy
 Predicting the value function

 Optimal policy, 𝜋 ∗ , is the one (not necessarily unique) in which the

value function is the greatest over all possible states.
 Optimal value functions can be of two types:
 Optimal state value function:

 Optimal action value function:

 Optimal state value function:

 Assumes taking the optimal action in each case

 Optimal action value function:

 Considers taking action 𝑎 this time, and then following the optimal policy
from then on
 Hence, optimal action value = current reward + discounted estimate of
the future reward
 How do we update values?
Updated Estimate ← Current Estimate +
𝜇(New Estimate – Current Estimate)
Temporal Difference (TD) method

 If we are remembering previous 𝜆 states and also updated their value

functions, then the above algorithm is called 𝑇𝐷(𝜆).
TD(0)

Off-policy decision
On-policy decision
Recall Getting Lost
Example

Reward Matrix
Action or Next State
Q-learning Example
 For the getting lost example, consider the following trail of steps for
an iteration: A,B,A,B,D,D,B,C,F Action or Next State
States A B C D E F
And the following initial Q matrix:
A -0.039 -0.028 -0.026 -0.018 0.013 0.011
B 0.028 -0.029 0.024 0.044 0.013 -0.026
C -0.018 -0.042 -0.017 -0.002 -0.026 0.047
D 0.013 0.025 -0.004 0.043 -0.015 0.011
E -0.018 0.011 -0.018 -0.009 -0.038 -0.033
F 0.013 -0.018 0.039 -0.026 0.011 -0.002

Compute matrix updates as during the iteration. Use

𝜇 learning rate = .6 and 𝛾 (discounting factor) = .3
Action or Next State
Trail: A,B,A,B,D,D,B,C,F States A B C D E F

𝜇 = .6 A -0.039 -0.028 -0.026 -0.018 0.013 0.011

𝛾 = .3 B 0.028 -0.029 0.024 0.044 0.013 -0.026
C -0.018 -0.042 -0.017 -0.002 -0.026 0.047
D 0.013 0.025 -0.004 0.043 -0.015 0.011
For 𝐴 → 𝐵: 𝑠 = 𝐴, 𝑎 = 𝐵, 𝑠 ′ = 𝐵, 𝑟 = 0
E -0.018 0.011 -0.018 -0.009 -0.038 -0.033
𝑄 𝐴, 𝐵 = 𝑄 𝐴, 𝐵
F 0.013 -0.018 0.039 -0.026 0.011 -0.002
+.6 0 + .3 × max 𝑄(𝐵, 𝑎′ ) − 𝑄(𝐴, 𝐵)
𝑎′ Updated A B C D E F
= −0.028 + .6 .3 × 0.044 + 0.028
A -0.039 −𝟎. 𝟎𝟎𝟑 -0.026 -0.018 0.013 0.011
= −0.003
B 0.028 -0.029 0.024 0.044 0.013 -0.026
For 𝐵 → 𝐴: 𝑠 = 𝐵, 𝑎 = 𝐴, 𝑠 ′ = 𝐴, 𝑟 = 0
C -0.018 -0.042 -0.017 -0.002 -0.026 0.047
D 0.013 0.025 -0.004 0.043 -0.015 0.011
E -0.018 0.011 -0.018 -0.009 -0.038 -0.033
F 0.013 -0.018 0.039 -0.026 0.011 -0.002
Q matrix after iteration # 10 Q matrix after iteration # 100 Q matrix after iteration # 1000
[[ -0.0 0.0 -inf -0.0 0.0 0.0] [[ -0.0 16.0 -inf -0.0 -inf -inf] [[ 1.4 16.0 -inf -0.0 -inf -inf]
[ -0.0 0.0 38.2 -0.0 0.0 -inf] [ 5.8 7.7 40.0 -0.0 0.0 -inf] [ 6.4 11.0 40.0 16.0 0.0 -inf]
[ -0.0 -0.0 -0.0 -0.0 0.0 99.8] [ -0.0 -0.0 34.1 6.2 0.0 100.0] [ -0.0 16.0 35.0 16.0 0.0 100.0]
[ -0.0 0.0 -0.0 -0.0 -0.0 -inf] [ -0.0 16.0 -0.0 0.8 -0.0 -inf] [ -0.0 16.0 39.7 10.9 40.0 -inf]
[ -0.0 -0.0 -0.0 0.0 -0.0 91.0] [ -0.0 -0.0 -0.0 0.0 -0.0 100.0] [ -0.0 -0.0 -0.0 16.0 35.0 100.0]
[ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]] [ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]] [ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]]
Q matrix after iteration # 30 Q matrix after iteration # 500
[[ -0.0 16.0 -inf -0.0 -inf -inf] [[ 1.4 16.0 -inf -0.0 -inf -inf]
[ 6.4 11.0 40.0 15.7 0.0 -inf]
[ 4.3 0.0 40.0 -0.0 0.0 -inf]
[ -0.0 16.0 35.0 15.9 0.0 100.0]
[ -0.0 -0.0 24.5 4.4 0.0 100.0]
[ -0.0 16.0 36.4 8.0 40.0 -inf]
[ -0.0 15.9 -0.0 0.8 -0.0 -inf]
[ -0.0 -0.0 -0.0 15.6 34.1 100.0]
[ -0.0 -0.0 -0.0 0.0 -0.0 99.8]
[ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]]
[ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]] [ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]]

Sample run of 1000 iterations for Q-learning (TD-0)

algorithm
Sarsa learning Example
 For the getting lost example, consider the following trail of steps taken
by the agent: D,B,B,B,B,C,C,C,F,C
Action or Next State
And the following initial Q matrix States A B C D E F
A -0.039 -0.028 -0.018 0.011 -0.026 0.011
B 0.028 -0.029 0.024 0.044 -0.018 -0.026
C 0.011 -0.042 -0.017 -0.002 -0.026 0.047
D -0.018 0.025 -0.004 0.043 -0.015 -0.018
E 0.011 -0.018 0.011 -0.009 -0.038 -0.033
F -0.018 -0.026 0.039 -0.026 0.011 -0.002

Compute matrix updates as during the iteration. Use

𝜇 learning rate = .6 and 𝛾 (discounting factor) = .3
Action or Next State
Trail: D,B,B,B,B,C,C,C,F,C States A B C D E F

𝜇 = .6 A -0.039 -0.028 -0.018 0.011 -0.026 0.011

𝛾 = .3 B 0.028 -0.029 0.024 0.044 -0.018 -0.026

C 0.011 -0.042 -0.017 -0.002 -0.026 0.047
D -0.018 0.025 -0.004 0.043 -0.015 -0.018
For 𝐷 → 𝐵: 𝑠 = 𝐷, 𝑎 = 𝐵, 𝑠 ′
= 𝐵, 𝑟 =0, 𝑎′
=𝐵
E 0.011 -0.018 0.011 -0.009 -0.038 -0.033
𝑄 𝐷, 𝐵 = 𝑄 𝐷, 𝐵
+.6 0 + .3 × 𝑄(𝐵, 𝐵) − 𝑄(𝐷, 𝐵) F -0.018 -0.026 0.039 -0.026 0.011 -0.002
Updated
= 0.025 + .6 .3 × (−0.029) − 0.025 A B C D E F
= 0.005 A -0.039 -0.028 -0.018 0.011 -0.026 0.011
B 0.028 -0.029 0.024 0.044 -0.018 -0.026
For 𝐵 → 𝐵: 𝑠 = 𝐵, 𝑎 =𝐵, 𝑠 ′
= 𝐵, 𝑟 = −5,
C 0.011 -0.042 -0.017 -0.002 -0.026 0.047
𝑎′ = 𝐵
D -0.018 𝟎. 𝟎𝟎𝟓 -0.004 0.043 -0.015 -0.018
E 0.011 -0.018 0.011 -0.009 -0.038 -0.033
F -0.018 -0.026 0.039 -0.026 0.011 -0.002
Q matrix after iteration # 10 Q matrix after iteration # 100 Q matrix after iteration # 1000
[[ -4.5 0.0 -0.0 -0.0 -0.0 -0.0] [[ -4.5 16.0 -0.0 -0.0 -0.0 -0.0] [[ 1.4 16.0 -0.0 -0.0 -0.0 -0.0]
[ -0.0 0.0 38.4 0.0 -0.0 -0.0] [ -0.0 7.7 40.0 0.0 -0.0 -0.0] [ 5.4 4.4 40.0 7.4 -0.0 -0.0]
[ -0.0 -0.0 -0.0 -0.0 0.0 99.9] [ -0.0 11.2 31.8 6.6 0.0 100.0] [ -0.0 9.1 35.0 11.9 0.0 100.0]
[ -0.0 -0.0 -0.0 0.0 -0.0 0.0] [ -0.0 -0.0 40.0 -4.5 -0.0 -inf] [ -0.0 7.3 14.5 10.9 38.2 -inf]
[ -0.0 -0.0 -0.0 -0.0 -0.0 97.3] [ -0.0 -0.0 -0.0 -0.0 -0.0 100.0] [ -0.0 -0.0 -0.0 14.0 32.8
[ -0.0 -0.0 0.0 -0.0 0.0 0.0]] [ -0.0 -0.0 0.0 -0.0 0.0 0.0]] 100.0]
Q matrix after iteration # 30 Q matrix after iteration # 500 [ -0.0 -0.0 0.0 -0.0 0.0 0.0]]
[[ -4.5 4.3 -0.0 -0.0 -0.0 -0.0] [[ 0.3 16.0 -0.0 -0.0 -0.0 -0.0]
[ -0.0 7.7 40.0 0.0 -0.0 -0.0] [ 3.0 10.4 40.0 14.0 -0.0 -0.0]
[ -0.0 -0.0 -0.0 -0.0 0.0 100.0] [ -0.0 6.6 35.0 12.9 0.0 100.0]
[ -0.0 -0.0 36.4 -4.5 -0.0 -inf] [ -0.0 8.7 21.2 -4.5 40.0 -inf]
[ -0.0 -0.0 -0.0 -0.0 -0.0 99.9] [ -0.0 -0.0 -0.0 15.8 -0.0 100.0]
[ -0.0 -0.0 0.0 -0.0 0.0 0.0]] [ -0.0 -0.0 0.0 -0.0 0.0 0.0]]

Sample run of 1000 iterations for SARSA algorithm

SARSA vs. Q

Every move gets a reward of -1

Moves that end up on cliff get a reward of -100
The SARSA solution is far from The Q-learning solution is optimal, but
optimal, but it is safe. occasionally the random search will tip it
over the cliff.
 SARSA vs. Q
 It
is our choice depending upon what we want
 Depends upon how serious a problem is falling from the cliff.
 Getting lost example… again!
Reward matrix used:
[[ -5.0 0.0 -inf -inf -inf -inf]
[ 0.0 -5.0 0.0 0.0 -inf -inf]
[ -inf 0.0 -5.0 0.0 -inf 100.0]
[ -inf 0.0 0.0 -5.0 0.0 -inf]
[ -inf -inf -inf 0.0 -5.0 100.0]
[ -inf -inf 0.0 -inf -inf 0.0]]
Q matrix after iteration # 10 Q matrix after iteration # 100 Q matrix after iteration # 1000
[[ -0.0 0.0 -inf -0.0 0.0 0.0] [[ -0.0 16.0 -inf -0.0 -inf -inf] [[ 1.4 16.0 -inf -0.0 -inf -inf]
[ -0.0 0.0 38.2 -0.0 0.0 -inf] [ 5.8 7.7 40.0 -0.0 0.0 -inf] [ 6.4 11.0 40.0 16.0 0.0 -inf]
[ -0.0 -0.0 -0.0 -0.0 0.0 99.8] [ -0.0 -0.0 34.1 6.2 0.0 100.0] [ -0.0 16.0 35.0 16.0 0.0 100.0]
[ -0.0 0.0 -0.0 -0.0 -0.0 -inf] [ -0.0 16.0 -0.0 0.8 -0.0 -inf] [ -0.0 16.0 39.7 10.9 40.0 -inf]
[ -0.0 -0.0 -0.0 0.0 -0.0 91.0] [ -0.0 -0.0 -0.0 0.0 -0.0 100.0] [ -0.0 -0.0 -0.0 16.0 35.0 100.0]
[ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]] [ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]] [ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]]
Q matrix after iteration # 30 Q matrix after iteration # 500
[[ -0.0 16.0 -inf -0.0 -inf -inf] [[ 1.4 16.0 -inf -0.0 -inf -inf]
[ 6.4 11.0 40.0 15.7 0.0 -inf]
[ 4.3 0.0 40.0 -0.0 0.0 -inf]
[ -0.0 16.0 35.0 15.9 0.0 100.0]
[ -0.0 -0.0 24.5 4.4 0.0 100.0]
[ -0.0 16.0 36.4 8.0 40.0 -inf]
[ -0.0 15.9 -0.0 0.8 -0.0 -inf]
[ -0.0 -0.0 -0.0 15.6 34.1 100.0]
[ -0.0 -0.0 -0.0 0.0 -0.0 99.8]
[ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]]
[ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]] [ -0.0 -0.0 -0.0 -0.0 -0.0 0.0]]

Q-learning (TD-0) algorithm – note the number of -inf

Q matrix after iteration # 10 Q matrix after iteration # 100 Q matrix after iteration # 1000
[[ -4.5 0.0 -0.0 -0.0 -0.0 -0.0] [[ -4.5 16.0 -0.0 -0.0 -0.0 -0.0] [[ 1.4 16.0 -0.0 -0.0 -0.0 -0.0]
[ -0.0 0.0 38.4 0.0 -0.0 -0.0] [ -0.0 7.7 40.0 0.0 -0.0 -0.0] [ 5.4 4.4 40.0 7.4 -0.0 -0.0]
[ -0.0 -0.0 -0.0 -0.0 0.0 99.9] [ -0.0 11.2 31.8 6.6 0.0 100.0] [ -0.0 9.1 35.0 11.9 0.0 100.0]
[ -0.0 -0.0 -0.0 0.0 -0.0 0.0] [ -0.0 -0.0 40.0 -4.5 -0.0 -inf] [ -0.0 7.3 14.5 10.9 38.2 -inf]
[ -0.0 -0.0 -0.0 -0.0 -0.0 97.3] [ -0.0 -0.0 -0.0 -0.0 -0.0 100.0] [ -0.0 -0.0 -0.0 14.0 32.8
[ -0.0 -0.0 0.0 -0.0 0.0 0.0]] [ -0.0 -0.0 0.0 -0.0 0.0 0.0]] 100.0]
Q matrix after iteration # 30 Q matrix after iteration # 500 [ -0.0 -0.0 0.0 -0.0 0.0 0.0]]
[[ -4.5 4.3 -0.0 -0.0 -0.0 -0.0] [[ 0.3 16.0 -0.0 -0.0 -0.0 -0.0]
[ -0.0 7.7 40.0 0.0 -0.0 -0.0] [ 3.0 10.4 40.0 14.0 -0.0 -0.0]
[ -0.0 -0.0 -0.0 -0.0 0.0 100.0] [ -0.0 6.6 35.0 12.9 0.0 100.0]
[ -0.0 -0.0 36.4 -4.5 -0.0 -inf] [ -0.0 8.7 21.2 -4.5 40.0 -inf]
[ -0.0 -0.0 -0.0 -0.0 -0.0 99.9] [ -0.0 -0.0 -0.0 15.8 -0.0 100.0]
[ -0.0 -0.0 0.0 -0.0 0.0 0.0]] [ -0.0 -0.0 0.0 -0.0 0.0 0.0]]

SARSA algorithm – note the number of -inf

USES OF REINFORCEMENT LEARNING
 Most popular is in intelligent robotics
 The robot can be left to attempt to solve the task without human intervention.
 Also popular at learning how to play games
Modelling Sample Problems
 To model sample problems, we need to model
𝑄 matrix (state-action matrix)
 Reward

 Remaining things like choice of policy, choice of algorithm, etc., are

simply choices, whose options are well established.
 They don’t need to be modelled.
Wading Through Maze
 Taken from
https://towardsdatascience.com/the-
other-type-of-machine-learning-
97ab81306ce9
Also see:
https://samyzaf.com/ML/rl/qmaze.h
tml
Wading Through Maze - II

Q-Table
Credit: https://www.freecodecamp.org/news/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc/
Tic-Tac-Toe

7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
32 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Lecture RL
No ratings yet
Lecture RL
37 pages
ML Unit 4
No ratings yet
ML Unit 4
17 pages
Multi-Agent Reinforcement Learning-Implementation of Hide and Seek
No ratings yet
Multi-Agent Reinforcement Learning-Implementation of Hide and Seek
7 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
64 pages
Unit 4
No ratings yet
Unit 4
49 pages
37 RL
No ratings yet
37 RL
18 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
86 pages
IntroductiontoRL BR
No ratings yet
IntroductiontoRL BR
22 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
DLMAIRIL01 Q4-2024 Session1
No ratings yet
DLMAIRIL01 Q4-2024 Session1
84 pages
Artificial Intelligence: Computer Science & Engineering, Khulna University
No ratings yet
Artificial Intelligence: Computer Science & Engineering, Khulna University
30 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Unit-6 Reinforcement Learning
No ratings yet
Unit-6 Reinforcement Learning
75 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
Unit 5
No ratings yet
Unit 5
107 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
62 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
Reinforcement Learning: Yijue Hou
No ratings yet
Reinforcement Learning: Yijue Hou
34 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
16 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
19 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
20 pages
Subtitle
No ratings yet
Subtitle
2 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Reinforcement
No ratings yet
Reinforcement
9 pages
Sections
No ratings yet
Sections
76 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
ML U5 Notes
No ratings yet
ML U5 Notes
26 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
ML Unit-V
No ratings yet
ML Unit-V
20 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
COMP3411 Week 4 - RL
No ratings yet
COMP3411 Week 4 - RL
79 pages
05 PracticeAssignment NeuralNetworksIII
No ratings yet
05 PracticeAssignment NeuralNetworksIII
1 page
09 SupportVectorMachines
No ratings yet
09 SupportVectorMachines
58 pages
09 PracticeAssignment SVM
No ratings yet
09 PracticeAssignment SVM
1 page
04 PracticeAssignment NerualNetworksII
No ratings yet
04 PracticeAssignment NerualNetworksII
2 pages
04 NeuralNetworksII
No ratings yet
04 NeuralNetworksII
74 pages
Macabacus Manual: Applies To The Following and Subsequent Versions
No ratings yet
Macabacus Manual: Applies To The Following and Subsequent Versions
121 pages
Alka Tiwari
No ratings yet
Alka Tiwari
37 pages
Dsu Map W22
No ratings yet
Dsu Map W22
25 pages
Xilinx Answer 64761 Ultrascale Devices
No ratings yet
Xilinx Answer 64761 Ultrascale Devices
35 pages
Crash 2025 06 23 - 16.25.08 FML
No ratings yet
Crash 2025 06 23 - 16.25.08 FML
4 pages
Final Draft Ict Diploma November 2024 Exam Timetable
No ratings yet
Final Draft Ict Diploma November 2024 Exam Timetable
5 pages
Unit
No ratings yet
Unit
7 pages
Dolat Capital Interview Experience
0% (1)
Dolat Capital Interview Experience
6 pages
RWS - Q$ - Module 1 - Week 1 - Hypertext and Itntertext
No ratings yet
RWS - Q$ - Module 1 - Week 1 - Hypertext and Itntertext
27 pages
Excel Project 2 Instructions
No ratings yet
Excel Project 2 Instructions
4 pages
GEKKO Fire Alarm Control Panel
No ratings yet
GEKKO Fire Alarm Control Panel
2 pages
CGO MARAWI LANAO DEL SUR-Human Resource Management Officer III PDF
No ratings yet
CGO MARAWI LANAO DEL SUR-Human Resource Management Officer III PDF
1 page
Pricing Determination Procedure PDF
No ratings yet
Pricing Determination Procedure PDF
62 pages
MG SoftwareDevelopment L4 AlgorithmFundametals
No ratings yet
MG SoftwareDevelopment L4 AlgorithmFundametals
13 pages
PTP SOP Download
No ratings yet
PTP SOP Download
8 pages
Empowerment Technology: Quarter 1 - Module 8
No ratings yet
Empowerment Technology: Quarter 1 - Module 8
24 pages
Mock #13 QP
No ratings yet
Mock #13 QP
19 pages
CAM DAHUA BULLET DH-IPC-HFW1239S1-A-LED-S4 - Datasheet
No ratings yet
CAM DAHUA BULLET DH-IPC-HFW1239S1-A-LED-S4 - Datasheet
3 pages
OS Lab Assignment Final
No ratings yet
OS Lab Assignment Final
6 pages
Web Technology NCS504 2016 17
No ratings yet
Web Technology NCS504 2016 17
3 pages
Run Clubs Gmail
No ratings yet
Run Clubs Gmail
2 pages
Letter of Motivation - Information Technology - Frankfurt University of Applied Sciences
100% (1)
Letter of Motivation - Information Technology - Frankfurt University of Applied Sciences
2 pages
EOY (9th) Topic 2024
No ratings yet
EOY (9th) Topic 2024
8 pages
5 Elsevier
No ratings yet
5 Elsevier
25 pages
Security Chp1 Lab-A Rsrch-Net-Attack Instructor
No ratings yet
Security Chp1 Lab-A Rsrch-Net-Attack Instructor
5 pages
Msccs 1
No ratings yet
Msccs 1
1 page
Book123z
No ratings yet
Book123z
138 pages
REVISED 2020 CAPE Digital Media Unit 2 Word Document Summary Version 1 Oct 17, 2019 PDF
No ratings yet
REVISED 2020 CAPE Digital Media Unit 2 Word Document Summary Version 1 Oct 17, 2019 PDF
14 pages
Module 3 Private Blockchain System
No ratings yet
Module 3 Private Blockchain System
34 pages
Non Working
No ratings yet
Non Working
17 pages