X
(https://swayam.gov.in) abhishekkanade9102@gmail.com
(https://swayam.gov.in/nc_details/NPTEL)
NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)
Announcements (announcements) About the Course (preview) Q&A (forum) Progress (student/home) Mentor (student/mentor)
Review Assignment (assignment_review) Course Recommendations (/course_recommendations)
Click to register for
Certification exam
Week 4: Assignment 4
(https://examform.nptel.ac.in/2025_10/exam_form/dashboard)
Your last recorded submission was on 2025-08-19, 23:22 IST Due date: 2025-08-20, 23:59 IST.
If already registered, click
to check your payment 1) State True/False 1 point
status The state transition graph for any MDP is a directed acyclic graph.
True
False
Course outline 2) Consider the following statements: 1 point
(i) The optimal policy of an MDP is unique.
About NPTEL () (ii) We can determine an optimal policy for a MDP using only the optimal value function(𝑣 ∗ ), without accessing the MDP parameters.
(iii) We can determine an optimal policy for a given MDP using only the optimal q-value function(𝑞 ∗ ), without accessing the MDP parameters.
How does an NPTEL
online course work? () Which of these statements are false?
Only (ii)
Week 0 ()
Only (iii)
Only (i), (ii)
Week 1 ()
Only (i), (iii)
Only (ii), (iii)
Week 2 ()
3) Which of the following statements are true for a finite MDP? (Select all that apply). 1 point
Week 3 ()
The Bellman equation of a value function of a finite MDP defines a contraction in Banach space (using the max norm).
Week 4 ()
If 0 ≤ γ < 1 , then the eigenvalues of γ𝑃π are less than 1 .
MDP Modelling (unit? We call a normed vector space ’complete’ if Cauchy sequences exist in that vector space.
unit=42&lesson=43)
The sequence defined by 𝑣 𝑛 = 𝑟π + γ𝑃π 𝑣 𝑛−1 is a Cauchy sequence in Banach space (using the max norm).
Bellman Equation (unit?
(𝑃π is a stochastic matrix)
unit=42&lesson=44)
4) Which of the following is a benefit of using RL algorithms for solving MDPs? 1 point
Bellman Optimality
Equation (unit? They do not require the state of the agent for solving a MDP.
unit=42&lesson=45)
They do not require the action taken by the agent for solving a MDP.
Cauchy Sequence and They do not require the state transition probability matrix for solving a MDP.
Green's Equation (unit? They do not require the reward signal for solving a MDP.
unit=42&lesson=46)
5) Consider the following equations: 1 point
Banach Fixed Point
Theorem (unit?
unit=42&lesson=47)
(i) 𝑣 π (𝑠) = 𝔼π [∑∞ γ𝑖−𝑡 𝑅𝑖+1 |𝑆𝑡 = 𝑠]
𝑖=𝑡
(ii) 𝑞 π (𝑠, 𝑎) = ∑𝑠′ 𝑝(𝑠 ′ |𝑠, 𝑎)𝑣 π (𝑠 ′ )
Convergence Proof (unit? (iii) 𝑣 π (𝑠) = ∑𝑎 π(𝑎|𝑠)𝑞 π (𝑠, 𝑎)
unit=42&lesson=48)
Week 4 Feedback Form : Which of the above are correct?
Reinforcement Learning
(unit?unit=42&lesson=237)
Only (i)
Only (i), (ii)
Practice: Week 4 : Only (ii), (iii)
Assignment 4(Non Graded)
Only (i), (iii)
(assessment?name=288)
(i), (ii), (iii)
Quiz: Week 4:
Assignment 4 6) What is true about the γ (discount factor) in reinforcement learning? 1 point
(assessment?name=289)
Discount factor can be any real number
Week 5 ()
The value of γ cannot affect the optimal policy
The lower the value of gamma, the more myopic the agent gets, i.e the agent maximises rewards that it receives over a shorter horizon
DOWNLOAD VIDEOS ()
7) Consider the following statements for a finite MDP (𝐼 is an identity matrix with dimensions |𝑆| × |𝑆|(𝑆 is the set of all states) and 𝑃π 1 point
NPTEL Resources () is a stochastic matrix):
(i) MDP with stochastic rewards may not have a deterministic optimal policy.
(ii) There can be multiple optimal stochastic policies.
(iii) If 0 ≤ γ < 1 , then rank of the matrix 𝐼 − γ𝑃π is equal to |𝑆| .
(iv) If 0 ≤ γ < 1 , then rank of the matrix 𝐼 − γ𝑃π is less than |𝑆| .
Which of the above statements are true?
Only (ii), (iii)
Only (ii), (iv)
Only (i), (iii)
Only (i), (ii), (iii)
8) Consider an MDP with 3 states 𝐴, 𝐵, 𝐶 . At each state we can go to either of the two states. i.e if we are in state 𝐴 then we can 1 point
perform 2 actions, going to state 𝐵 or 𝐶 . The rewards for each transactions are 𝑟(𝐴, 𝐵) = −3 (reward if we go from 𝐴 to 𝐵 ), 𝑟(𝐵, 𝐴) = −1 ,
𝑟(𝐵, 𝐶) = 8 , 𝑟(𝐶, 𝐵) = 4 , 𝑟(𝐴, 𝐶) = 0, 𝑟(𝐶, 𝐴) = 5, discount factor is 0.9 . Find the fixed point of the value function for the policy π(𝐴) = 𝐵 (if
we are in state 𝐴 we choose the action to go to 𝐵 ) π(𝐵) = 𝐶, π(𝐶) = 𝐴. 𝑣 π ([𝐴𝐵𝐶]) =? (round to 1 decimal place)
[20.6, 21.8, 17.6]
[30.4, 44.2, 32.4]
[30.4, 37.2, 32.4]
[21.6, 21.8, 17.6]
9) Which of the following is not a valid norm function? (𝑥 is a 𝐷 dimensional vector) 1 point
max𝑑∈{1,…,𝐷} |𝑥𝑑 |
‾‾‾‾‾‾‾2
√Σ𝑑=1 𝑥𝑑
𝐷
min 𝑑∈{1,…,𝐷} |𝑥𝑑 |
Σ𝐷
𝑑=1 |𝑥𝑑 |
10) For an operator 𝐿 , which of the following properties must be satisfied by 𝑥 for it to be a fixed point for 𝐿 ?(Multi-Correct) 1 point
𝐿𝑥 = 𝑥
𝐿2 𝑥 = 𝑥
∀λ > 0𝐿𝑥 = λ𝑥
None of the above
You may submit any number of times before the due date. The final submission will be considered for grading.
Submit Answers