2021/09/30
Reinforcement learning:
Basic concepts and Applications in networking
                Nguyen Phi Le
Agenda
◦ Definition and basic terms
◦ Value-based RL
 ◦ Tabular RL
 ◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking
                               2
Agenda
◦ Definition and basic terms
◦ Value-based RL
 ◦ Tabular RL
 ◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking
                               3
Example
          4
 Interactive learning
                                                                                 Sensor node
                                                                        Base station
  Multi-Armed Bandit                   Traffic light control
                                                                    Wireless sensor networks
ü The learner is not told which actions to take à trial-and-error
ü Action may affect also future situation à delayed reward
                                                                                         5
Definition
◦ Reinforcement learning
 ◦ Learning what to do—how to map situations to actions— to achieve the goal
◦ Reward hypothesis
 ◦ That all of what we mean by goals and purposes can be well thought of as the
   maximization of the expected value of the cumulative sum of a received scalar
   signal (called reward) (Richard S. Sutton, RL: An introduction, 2018)
                                                                                   6
 Goal and reward
 ◦ Reward: a scalar signal received from the environment at each step
 ◦ Goal: maximizing cumulative reward in the long run                               Sensor node
   Multi-Armed Bandit             Traffic light control
                                                                            Base station
Goal = maximizing total gain   Goal = decreasing traffic jam   Goal = increasing network lifetime
Reward = gain at every turn    Reward = waiting time,          Reward = load balancing,
                               queue length, lane speed, …     route length, …
                                                                                                  7
RL vs other learning techniques
◦ RL vs supervised learning
 ◦ Supervised learning: learning from a training set of labeled examples
   ◦ Know the true action to take
 ◦ Reinforcement learning: do not know the optimal action
◦ RL vs unsupervised learning
 ◦ Unsupervised learning: finding structure hidden in collections of unlabeled data
 ◦ Reinforcement learning: maximizing the reward signal
                                                                                      8
RL framework
◦ Policy: a mapping from states to actions
    ◦ may be stochastic, specifying probabilities for each action
◦ Reward signal: the goal of a reinforcement learning problem
    ◦ objective is to maximize the total reward the agent receives over the long run
◦ Value: total amount of accumulative reward over the future, starting from
  that state
    ◦ Reward: immediate; Value: long-run
o   Action: can be any decisions we want to learn how to make
o   State: can be anything we can know that might be useful in making them
                            Agent                   Reward signal
                                                     Action
                              State      Policy                       Environement
                                                    New state
                                                                                       9
Agenda
◦ Definition and basic terms
◦ Value-based RL
 ◦ Tabular RL
 ◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking
                               10
Value-based RL
◦ RL’s objective: making decision
 ◦ Input: state
 ◦ Output: action
                                                                          Reward signal
◦ How to decide an action?                          Agent
                                                                 value
 ◦ Using action selection policy                               function
   ◦ Based on action/state value
      ◦ How to measure the values?                                         Action
                                                       State    Policy
         ◦ Value functions: estimate the goodness
           of states/actions based on experience
                                                                          New state
   ◦ Exploration-exploitation strategy                                                Environement
                                                                                            11
Value functions
◦ Key point of value-based RL models
◦ Questions?
 ◦ What can be used to measure the goodness of an action, state?
 ◦ How to measure?
           Maximize: long-term accumulative reward
           𝐺! = 𝑅!"# + 𝛾𝑅!"$ + ⋯ + 𝛾 % 𝑅!"%&# + … = ∑*
                                                     '() 𝛾 '
                                                             𝑅!"'"#
                  Reward received after             Reward received if following policy 𝜋
    Goal
                  performing action 𝑎 at timing 𝑡   from timing 𝑡 + 1
                                                                                       12
Value functions
◦ State value function: how good it is for the agent to be in a given state
  ◦ Expected return when starting in s and following thereafter policy 𝜋
◦ Action value function: how good it is to perform a given action in a
  given state
  ◦ Expected return starting from s, taking the action a, and thereafter following policy
    𝜋
 State value function for policy 𝜋
𝑣& 𝑠 ÷ 𝔼& 𝐺' 𝑆' = 𝑠 = 𝔼& ∑+    (
                          ()* 𝛾 𝑅',(,- 𝑆' = 𝑠
 Action value function for policy 𝜋
𝑞& 𝑠, 𝑎 ÷ 𝔼& 𝐺' 𝑆' = 𝑠, 𝐴' = 𝑎 = 𝔼& ∑+    (
                                     ()* 𝛾 𝑅',(,- 𝑆' = 𝑠, 𝐴' = 𝑎
                                                                                        13
Action value estimation
◦ 𝑞& 𝑠, 𝑎 ÷ 𝔼& 𝐺' 𝑆' = 𝑠, 𝐴' = 𝑎 = 𝔼& ∑+
                                       ()* 𝛾 (𝑅
                                               ',(,- 𝑆' = 𝑠, 𝐴' = 𝑎
◦ How to estimate 𝑞& 𝑠, 𝑎 ?
◦ 𝔼& ∑+()* 𝛾 (𝑅
               ',(,- 𝑆' = 𝑠, 𝐴' = 𝑎 = expected value à need to perform
  the same action at the same state for infinite times à impossible
◦ Solution
  ◦ Replacing expected value by sample-based estimation
    ◦ Estimating value of 𝔼! ∑%    "
                              "#$ 𝛾 𝑅&'"'( 𝑆& = 𝑠, 𝐴& = 𝑎 using a few samples
  ◦ Estimation method: iterative estimation
    ◦ Initialize 𝑞! 𝑠, 𝑎
    ◦ Update 𝑞! 𝑠, 𝑎 after performing one action and receiving a reward
       ◦ Current 𝑞! 𝑠, 𝑎
       ◦ Estimate of 𝔼! ∑%    "
                         "#$ 𝛾 𝑅&'"'( 𝑆& = 𝑠, 𝐴& = 𝑎
    ◦ New 𝑞! 𝑠, 𝑎 ← 1 − 𝛼 × old 𝑞! 𝑠, 𝑎 + 𝛼𝑞<! (𝑠, 𝑎)
                                                                                14
  Action value estimation
    ◦ How to estimate action value?
    ◦ Bellman equation
     𝑞, 𝑠, 𝑎 = 𝔼, 𝐺! 𝑆! = 𝑠, 𝐴! = 𝑎
                  = 𝔼 , ∑*
                         '() 𝛾 '
                                 𝑅!"'"# 𝑆! = 𝑠, 𝐴! = 𝑎
                  = 𝔼, 𝑅!"# + ∑*
                               '(# 𝛾 '
                                       𝑅!"'"# 𝑆! = 𝑠, 𝐴! = 𝑎
                  = 𝔼, 𝑅!"# + 𝛾 ∑*
                                 '() 𝛾 '
                                         𝑅!"'"$ 𝑆! = 𝑠, 𝐴! = 𝑎
Action value of   = 𝔼, 𝑅!"# + 𝛾𝑞, 𝑆!"# , 𝑎′ 𝑆! = 𝑠, 𝐴! = 𝑎
     (𝑠, 𝑎)
                           Immediate reward after
           expectation                                      Action value at the next step
                           performing action a at state s
                                                                                            15
Action value estimation
 ◦ Bellman equation
 ◦ 𝑞& 𝑠, 𝑎 = 𝔼& 𝑅',- + 𝛾𝑞& 𝑆',-, 𝑎′ 𝑆' = 𝑠, 𝐴' = 𝑎
                      Object to be estimated 𝑞?! (𝑠, 𝑎)
 New 𝑞& 𝑠, 𝑎 = (1 − 𝛼) current 𝑞& 𝑠, 𝑎 + 𝛼×3
                                           𝑞& (𝑠, 𝑎)
   𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾𝑄 𝑠 - , 𝑎-              SARSA
                                                                  16
    Action value estimation
       Bellman equation
      𝑞, 𝑠, 𝑎 = 𝔼, 𝑅!"# + 𝛾𝑞, 𝑆!"# , 𝑎′ 𝑆! = 𝑠, 𝐴! = 𝑎
      𝑞,∗ 𝑠, 𝑎 = 𝔼,∗ 𝑅!"# + 𝛾𝑞,∗ 𝑆!"# , 𝑎′ 𝑆! = 𝑠, 𝐴! = 𝑎
                = 𝔼,∗ 𝑅!"# + 𝛾 max 𝑞, 𝑆!"# , 𝑎′ |𝑆! = 𝑠, 𝐴! = 𝑎
                                     ,
                                        - -
estimation: 𝑞:
             , ∗ (𝑠, 𝑎) = 𝑟 + 𝛾 max 𝑄 𝑠  ,𝑎
                                .-
           𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max 𝑄 𝑠 - , 𝑎-         Q-learning
                                                .-
           𝑄 𝑠, 𝑎      1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾𝑄 𝑠 - , 𝑎-     SARSA
                                                                         17
State value estimation
Bellman equation for state-value function
𝑣, 𝑠 = 𝔼, 𝐺! 𝑆! = 𝑠
     = 𝔼 , ∑*
            '() 𝛾 '
                    𝑅!"'"# 𝑆! = 𝑠
                                             Monte Carlo
     = 𝔼, 𝑅!"# + 𝛾 ∑*
                    '() 𝛾 '
                            𝑅!"'"$ 𝑆! = 𝑠
     = 𝔼, 𝑅!"# + 𝛾𝑣, (𝑆!"# ) 𝑆! = 𝑠
New 𝑣, 𝑠 = (1 − 𝛼) current 𝑣, 𝑠 + 𝛼×𝐸
                                                    Temporal
                                                    Difference
    𝑉 𝑆! = 1 − 𝛼 𝑉 𝑆! + 𝛼𝐺!                         (TD)
    𝑉 𝑆! = 1 − 𝛼 𝑉 𝑆! + 𝛼(𝑅!"# +𝛾𝑉(𝑆!"# ))
                                                                 18
Action selection policy
◦ Based on value functions
◦ Exploitattion – exploration strategy
 ◦ Choose (𝑠, 𝑎) whose 𝑄 𝑠, 𝑎 is the greatestß exploitation
   ◦ Problem: local maximum à need to exploit new actions ß exploration
 ◦ Some exploration methods
   ◦ 𝜖-greedy
      ◦ Time adaptive Epsilon (annealing 𝜖)
      ◦ 𝜖-soft
      ◦ Value adaptive Epsilon
      ◦ Sigmoid Epsilon
   ◦ Optiministic initialization
   ◦ Upper confidence bound (UCB)
                                                                          19
𝜖-greedy
◦ Define a small number 𝜖 (𝜖 = 0.01)
◦ Each time the agent needs to choose an action
 ◦ Generates a prob 𝑝 ∈ (0,1)
         𝑎𝑟𝑔𝑚𝑎𝑥@ 𝑞 𝑠, 𝑎 , 𝑖𝑓 𝑝 ≥ 𝜖
 ◦ 𝐴' = 7
          𝑟𝑎𝑛𝑑𝑜𝑚 𝑎𝑐𝑡𝑖𝑜𝑛, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
                                     E-greedy with e = 0.1 in theory
           Best Action
               Others
                         0   1   2        3        4        5        6      7   8   9   10
                                         Number of time actions are taken
                                                                                             20
𝜖-greedy multi-armed bandit
◦ Action space: 𝐴 = 𝑎! , … , 𝑎"                          Multi-Armed Bandit
◦ State space: one state
◦ Reward: the gain recevied
◦ Value function
             "#$ %& '()*'+" ),(- * !*.(- /'0%' !% !
  ◦ 𝑄! 𝑎 ÷
               -#$1(' %& !0$(" * !*.(- /'0%' !% !
◦ Exploration strategy                                Goal = maximizing total gain
  ◦ 𝜖-greedy                                          Reward = gain at every turn
                                                                               21
Exploration vs Exploitation
        The higher the 𝜖, the more exploration
        à The more unstable the average reward
                                                 22
Exploration vs Exploitation
   The frequency of selecting optimal action is inversely proportional with 𝜖
                                                                                23
Optiministic initialization
◦ Initialize 𝑄) 𝑎 = 𝑋 ∀𝑎 (𝑋 is a significantly large number)
            BCD EF GHI@GJB IKHL @ '@(HL MGNEG 'E '
 ◦ 𝑄' 𝑎   ÷
              LCDOHG EF 'NDHB @ '@(HL MGNEG 'E '
◦ Action selection strategy
 ◦ At every step 𝑡: choose 𝐴' = 𝑎𝑟𝑔𝑚𝑎𝑥@ 𝑄' 𝑎
 ◦ Guarantee that all actions are chosen at least one time
              "! # $%"
   ◦ 𝑄! 𝑎 =              < 𝑄' 𝑎
                  &
                                                               24
Optiministic initialization
◦ Highly exploration at first
  ◦ But, totally greedy afterwards
◦ Appropriate value of X is needed
                                     25
Upper confidence bound (UCB)
◦ Actions having been selected frequently à should be explored less
◦ Confidence bound is defined by the number of time the action not
  being selected
◦ Priotirizing actions with
  ◦ High estimated action value à exploitation
  ◦ High confidence bound à exploration
                       Action 1
                       Action 2
                       Action 3
                       Action 4
                       Action 5
                                  0          2           4           6    8
                                      Estimate Value   Confidence Bound
                                                                              26
Upper confidence bound (UCB)
◦ Mathematical expression
 ◦ The number of times an action 𝑎 has been taken: 𝑁' (𝑎)
 ◦ The total number of time steps so far: 𝑁'
 ◦ Confidence bound
                  )#
   ◦ 𝐶𝐵( 𝑎 = 𝑐            (𝑐: a constant)
                 )# (#)
 ◦ Action selection policy
                                                      )#
   ◦ At every step 𝑡: choose 𝐴( = 𝑎𝑟𝑔𝑚𝑎𝑥# 𝑄( 𝑎 + 𝑐
                                                     )# (#)
   ◦ Intuition
      ◦ If an action 𝑎 has not been selected for a long time à 𝑁( (𝑎) decreases à
        𝐶𝐵( 𝑎 increase à it will be prioritized next times
                                                                                    27
Upper confidence bound (UCB)
                               28
Other exploration strategies
◦ Time adaptive epsilon (annealing 𝜖)
                   -
    ◦𝜖=
             PQR('NDH,S)
    ◦ 𝑐: a smalle positive parameter;
    ◦ 𝑡𝑖𝑚𝑒: the total times performing actions à the larger the time, the
      smaller the 𝜖 à the more experience, the less the exploration
◦ 𝜖-soft
    ◦ Guarantee that the probabilities for choosing any action is at least
      T
     |V2 |
    ◦ 𝐴B : action space
                                                                             29
Other exploration strategies
◦ Value adaptive Epsilon
   ◦ Adjust exploration probability based on the variation of the value
     function
      ◦ When the value varies much à unstableà agent seems not to
        know much about the environment à should explore more
      ◦ When the value is stable à not need to explore much
        Mathematical expression
                            $ %#&" ',) $%# ',)
                        !,-          *                          &
    ◦ 𝑓 𝑠( , 𝑎( , 𝜎 =       $ %#&" ',) $%# ',)
                                                 =       $ %#&" ',) $%# ',)
                                                                              − 1 à 𝑄($! 𝑠, 𝑎 − 𝑄( 𝑠, 𝑎   càng
                        !$-          *               !$-          *
      lớn, 𝑓 càng lớn
    ◦ 𝜖(($!) 𝑠 = 𝛿𝑓 𝑠( , 𝑎( , 𝜎 + 1 − 𝛿 𝜖             (   𝑠
                                                                                                             30
Other exploration strategies
◦ Sigmoid Epsilon
                         -
 ◦ 𝑃HWMXEGH = 1 −
                    -,H 34×678(:)
 ◦ 𝜔: hyper-parameter
 ◦ 𝑣𝑎𝑟 𝑞 : variance of the values
   ◦ When 𝑣𝑎𝑟 𝑞 is sufficiently large à there is an action that dominates the others à
     should be chosen
   ◦ When 𝑣𝑎𝑟 𝑞 is small à all actions are equals à should increase explore
                                                                                     31
Tabular RL
◦ Use a table (named Q-table) to
  represent the action values
                                            State   Action   Q-value
◦ Update Q-table after performing
  every action                               𝑆!      𝐴!      𝑄(𝑆! , 𝐴! )
                                             𝑆!      𝐴"      𝑄(𝑆! , 𝐴" )
◦ Drawbacks
 ◦ When state/action spaces are too large
   à table size gets large too
                                             𝑆#      𝐴$      𝑄(𝑆# , 𝐴$ )
 ◦ Cannot deal with newly appear
   state/actions
    ◦ When training the model offline
                                                                     32
Approximate RL
◦Represent the relation of (𝑠, 𝑎, 𝑞) by a function or a
 model: 𝑞(𝑠,
        ' 𝑎)
 ◦ Input: a vector reprenting 𝑠
                 N 𝑎) are updated after every action
 ◦ Paramters of 𝑞(𝑠,
    ◦ Using gradient descent
◦Advantages
 ◦ Handle a large state/action space
 ◦ Can deal with newly appearing states
                                                          33
Approximate RL
Supervised learning:
                             Problems:
𝑞<𝕨 𝑠, 𝑎 , 𝑞∗ 𝑠, 𝑎                                       Solution:
     1                   "   In most cases, 𝑞∗ 𝑠, 𝑎 is
𝐿 = 𝑞∗ 𝑠, 𝑎 − 𝑞<𝕨 𝑠, 𝑎                                   Estimates 𝑞∗ 𝑠, 𝑎
     2                       unknown
𝕨 ÷ 𝕨 − 𝛼∇𝐿
                                                                         34
Recall about tabular RL
 ◦ Tabular Q-learning:
 ◦ 𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_
                                               @_
            = 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max 𝑄 𝑠 - , 𝑎- − 𝑄 𝑠, 𝑎
                                          .-
                                   Estimated value: 𝑌         Observed value: 𝑋
                                           b
      -                                            cd
 𝐿=       𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_ − 𝑄 𝑠, 𝑎        à        = − 𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_ − 𝑄 𝑠, 𝑎
      b          @_                                ce               @_
 𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 − 𝛼∇𝐿 à Gradient descent
                                                                                            35
Approximate RL
◦ Tabular Q-learning
                                                 b
       -
 ◦𝐿=        𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_ − 𝑄 𝑠, 𝑎            ; 𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 − 𝛼∇𝐿
       b               @_
                        𝑌                𝑋
◦ Approximate Q-learning: 𝑄𝕨 𝑠, 𝑎
                                             &
       !
 ◦𝐿=       𝑟 + 𝛾 max 𝑄𝕨 𝑠 . , 𝑎. − 𝑄𝕨 𝑠, 𝑎
       &          #.
 ◦ E.g: 𝑄𝕨 𝑠, 𝑎 = 𝑤! 𝑓! 𝑠, 𝑎 + 𝑤& 𝑓& 𝑠, 𝑎 + … + 𝑤0 𝑓0 𝑠, 𝑎
   ◦ 𝑤1 ⟵ 𝑤1 + 𝛼 𝑟 + 𝛾 max 𝑄𝕨 𝑠 . , 𝑎. − 𝑄𝕨 𝑠, 𝑎         𝑓1 𝑠, 𝑎
                            #.
                                                                               36
Approximate RL
◦ Approximate Q-learning: 𝑄𝕨 𝑠, 𝑎
                                    '   '                )*𝕨
 ◦ 𝑤% ⟵ 𝑤% + 𝛼 𝑟 + 𝛾 max 𝑄𝕨 𝑠 , 𝑎 − 𝑄𝕨 𝑠, 𝑎              )+$
                           &'
◦ SARSA
 ◦ Tabular SARSA: 𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾𝑄 𝑠 % , 𝑎%
                                                 %   %         '(𝕨
 ◦ Approximate SARSA: 𝑤& ⟵ 𝑤& + 𝛼 𝑟 + 𝛾𝑄𝕨 𝑠 , 𝑎 − 𝑄𝕨 𝑠, 𝑎
                                                               ')'
                                                                     37
Approximate RL
◦ 𝑄𝕨 𝑠, 𝑎
 ◦ Can be a very simple linear function
 ◦ Can be a machine learning model, ANN, …
◦ When 𝑄𝕨 𝑠, 𝑎 is a deep learning model à we have deep
  reinforcement learning
                                     Action 1, Q-value 1
               state
                        DNN
                                     Action n, Q-value 2
                                                           38
Tabular RL vs Deep RL
                        State   Action   Q-value
                         𝑆!      𝐴!      𝑄(𝑆! , 𝐴! )
              𝑆( , 𝐴)    𝑆!      𝐴"      𝑄(𝑆! , 𝐴" )   𝑄(𝑆( , 𝐴) )
 Tabular RL
                         𝑆#      𝐴$      𝑄(𝑆# , 𝐴$ )
                                                         𝐴! , 𝑄(𝑆( , 𝐴! )
                                                         𝐴" , 𝑄(𝑆( , 𝐴" )
              𝑆(
   DRL                          DNN
                                                          𝐴$ , 𝑄(𝑆( , 𝐴$ )
                                                                             39
Agenda
◦ Definition and basic terms
◦ Value-based RL
 ◦ Tabular RL
 ◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking
                               40
Policy-based RL
                   Estimate values (action value, state value)
◦ Monte Carlo
◦ TD                                                             Value-based RL
◦ SARSA
                    Choose an action based on the value
◦ Q-learning
 What if we learn to make the decision directly instead of estimating
 value and then making the decision based on the value?
                                Policy-based RL
                                                                           41
Value-based RL vs Policy-based RL
       Q-learning, SARSA,
                               𝜖-greedy, 𝜖-soft, …
       Monte carlo, TD, …
       Value estimation         Action selection
                     Value-based RL
                                                     42
Value-based RL vs Policy-based RL
         Q-learning, SARSA,
                                 𝜖-greedy, 𝜖-soft, …
         Monte carlo, TD, …
         An Value
             end2end policy
                      Action selection
             estimation
                    Policy-based RL
                                                       43
Value-based RL
◦ Objective
 ◦ Policy function: 𝜋 𝑎 𝑠 : the probability for performing an action 𝑎 at
   state 𝑠
◦ Main idea
 ◦ 𝜋 𝑎 𝑠, Θ : 𝜋 can be a function, a model with parameters Θ needed to
   be learned
 ◦ 𝒥(Θ): scalar performance measure
                  Z' ) à gradient ascent method à why not gradient
 ◦ 𝜃',- = 𝜃' + 𝛼∇𝒥(Θ
   descent?
            Estimation of 𝒥(Θ* )   Questions:
                                   1. How to define 𝒥 Θ( ?
                                                       V( )?
                                   2. How to estimate ∇𝒥(Θ
                                                                            44
Objective function 𝒥 Θ
◦ For episodic environments
 ◦ 𝒥 Θ = 𝔼& [𝐺- = 𝑅- + 𝛾𝑅b + ⋯ + 𝛾 Lh-𝑅L ]
◦ For continuing environemnts
 ◦ Expectation of state value
                                          )(3)
   ◦ 𝒥 Θ = 𝔼2 𝑉 𝑠   = ∑3 𝑑(𝑠) 𝑉 𝑠 = ∑3 ∑        +) 𝑉   𝑠 : expectation over all the states
                                         '+ )(3
 ◦ Expectation of reward
   ◦ 𝒥 Θ = 𝔼2 𝑟 = ∑3 𝑑(𝑠) ∑# 𝜋 𝑎 𝑠, Θ 𝑅 𝑎, 𝑠
                                                                                             45
𝒥 Θ estimation
                ,! )
◦𝜃!"# = 𝜃! + 𝛼 ∇𝒥(Θ
◦Goals
 ◦ Find an expression proportional with ∇𝒥 Θ à 𝐹 ∝ ∇𝒥 Θ
  ◦ Preferably is the expected value of a variable 𝑒
  ◦ à 𝐹 can be approximated by samples of 𝑒
                                                          46
𝒥 Θ estimation
                ,! )
◦𝜃!"# = 𝜃! + 𝛼 ∇𝒥(Θ
◦Goals
 ◦ Find an expression proportional with ∇𝒥 Θ à 𝐹 ∝ ∇𝒥 Θ
  ◦ Preferably is the expected value of a variable 𝑒
  ◦ à 𝐹 can be approximated by samples of 𝑒
                                                          47
𝒥 Θ estimation
◦ Definition of 𝒥 Θ
  ◦ 𝒥 Θ = 𝔼, 𝐺# = 𝑅# + 𝛾𝑅$ + ⋯ + 𝛾 %&# 𝑅% = 𝑣, (𝑠) ): state-value
    at the first state
 à𝑣! 𝑠 = ∑" 𝜋 𝑎 𝑠 𝑞! (𝑠, 𝑎)
 à∇𝑣! 𝑠 = ∑" ∇𝜋 𝑎 𝑠 𝑞! 𝑠, 𝑎 + 𝜋(𝑎|𝑠)∇𝑞! 𝑠, 𝑎
 à∇𝒥 Θ = ∇𝑣! 𝑠# ∝ ∑$ 𝜇 𝑠 ∑" ∇𝜋 𝑎 𝑠 𝑞! 𝑠, 𝑎 (Policy Gradient
  Theorem)
          Probability for 𝑠 to occur   The gradient of policy 𝜋   Action value
                                                                                 48
𝒥 Θ estimation
                ,! )
◦𝜃!"# = 𝜃! + 𝛼 ∇𝒥(Θ
◦Goals
 ◦ Find an expression proportional with ∇𝒥 Θ à 𝐹 ∝ ∇𝒥 Θ
  ◦ Preferably is the expected value of a variable 𝑒
  ◦ à 𝐹 can be approximated by samples of 𝑒
                                                          49
𝒥 Θ estimation for episode environement
 ∇𝒥 Θ ∝ ^ 𝜇 𝑠 ^ ∇𝜋 𝑎 𝑠 𝑞2 𝑠, 𝑎
        3        #
            = 𝔼2 ^ ∇𝜋 𝑎 𝑆( 𝑞2 𝑆( , 𝑎
                     #
        Expected value of this variable à approximate the expectation by the sample
                      V
                     ∇𝒥 Θ = ^ ∇𝜋 𝑎 𝑆( 𝑞2 𝑆( , 𝑎
                             #
              Θ($! ÷ Θ( + 𝛼 ^ ∇𝜋 𝑎 𝑆( 𝑞2 𝑆( , 𝑎
                              #
              Θ($! ÷ Θ( + 𝛼 ^ ∇𝜋 𝑎 𝑆( , Θ( 𝑞b 𝑆( , 𝑎, 𝕨
                              #
                                                                              50
  REINFORCE algorithm (1992)
◦ REINFORCE = REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility
   ◦ Main idea
       Θ($! ÷ Θ( + 𝛼 ^ ∇𝜋 𝑎 𝑆( , Θ( 𝑞2 𝑆( , 𝑎
                     #
               convert this into the expected value of a variable à approximate by its sample
                                  The sum over all actions 𝑎
                                  à to convert into the expected value over variales a à we need 𝜋 𝑎 𝑆𝑡
                                                    ∇𝜋 𝑎 𝑆* , Θ*
  i ∇𝜋 𝑎 𝑆*     𝑞- 𝑆* , 𝑎 = i 𝜋 𝑎 𝑆* , Θ* 𝑞- 𝑆* , 𝑎
                                                     𝜋 𝑎 𝑆* , Θ*
   ,                         ,
                                             ∇𝜋 𝐴* 𝑆* , Θ*                ∇𝜋 𝐴* 𝑆* , Θ*
                          = 𝔼- 𝑞- 𝑆* , 𝐴*                       = 𝔼-   𝐺*                 Vì 𝑞- 𝑆* , 𝐴* = 𝔼- 𝐺* |𝑆* , 𝐴*
                                              𝜋 𝐴* 𝑆* , Θ*                 𝜋 𝐴* 𝑆* , Θ*
                                             ∇-   𝐴* 𝑆* , Θ*                                  ∇-   𝐴* 𝑆* , Θ*
                                     𝔼- 𝐺*                     can be approximated by 𝐺*
                                             -    𝐴* 𝑆* , Θ*                                   -   𝐴* 𝑆* , Θ*
                                                                                                                     51
   REINFORCE algorithm (1992)
                                                                    Hướng của ∇𝜋 làm tăng xác suất lặp lại hành động 𝐴*
              V       ∇2 𝐴 𝑆 , Θ
           ◦ ∇𝒥 Θ = 𝐺( 2 𝐴 ( 𝑆 (, Θ (                               Tức là: nếu tăng Θ một lượng tỉ lệ thuận với
                                (   (     (
                                                                    ∇𝜋 𝐴* 𝑆* , Θ* thì càng ngày 𝐴* sẽ đươc thực hiện
                                ∇2 𝐴( 𝑆( , Θ(                       càng nhiều.
         à Θ($! ÷ Θ( + 𝛼𝐺(
                                 2 𝐴( 𝑆( , Θ(
𝐺* ×∇𝜋 𝐴* 𝑆* , Θ* : nếu 𝐺* càng lớn thì
càng khuyến khích tăng Θ theo hướng             ∇-   𝐴* 𝑆* , Θ*
                                                                : nếu𝜋 𝐴* 𝑆* , Θ* càng lớn thì càng không
∇𝜋 𝐴* 𝑆* , Θ*                                   -    𝐴* 𝑆* , Θ*
                                                khuyến khích tăng Θ theo hướng ∇𝜋 𝐴* 𝑆* , Θ*
                                                à Tránh trường hợp thực hiện một hành động quá nhiều lần
                                                                                                                   52
REINFORCE algorithm (1992)
                             53
REINFORCE with Baseline
◦ ∇𝒥 Θ ∝ ∑d 𝜇 𝑠 ∑. ∇𝜋 𝑎 𝑠 𝑞, 𝑠, 𝑎
 àLarge values of 𝑞& 𝑠, 𝑎 will dominate ∇𝒥 Θ
 àUse baseline to normalize 𝑞& 𝑠, 𝑎
∇𝒥 Θ ∝ a 𝜇 𝑠 a ∇𝜋 𝑎 𝑠 𝑞& 𝑠, 𝑎 − 𝑏 𝑠
                                                           = ∇ a 𝜋 𝑎 𝑠 = ∇1 = 0
           B         @
                                                                   @
a 𝜇 𝑠 a ∇𝜋 𝑎 𝑠 𝑞& 𝑠, 𝑎 − 𝑏 𝑠
 B         @
=∑B 𝜇 𝑠     ∑@ ∇𝜋 𝑎 𝑠 𝑞& 𝑠, 𝑎      − 𝑏 𝑠 ∑@ ∇𝜋 𝑎 𝑠
 How to choose 𝑏 𝑠 :
 - 𝑏 𝑠 can be any thing as long as its value is independent of 𝑎
 - We use 𝑏 𝑠 to decrease the variance
                                                                                  54
REINFORCE with Baseline
                     ∇,   𝐴! 𝑆! , Θ!
   Θ!"# ÷ Θ! + 𝛼𝐺!   ,    𝐴! 𝑆! , Θ!
                                       ∇,   𝐴! 𝑆! , Θ!
à Θ!"# ÷ Θ! + 𝛼 𝐺! − 𝑏(𝑆! ) 𝐺!
                                       ,    𝐴! 𝑆! , Θ!
     The expected value of 𝑣 𝑆𝑡
     à A natual selection of 𝑏(𝑆𝑡 ) is 𝑣Z
                                        𝑆𝑡 , 𝑣(𝑆
                                             N 𝑡 , 𝕨)
                ' ! , 𝕨! )
  𝕨!"# ÷ 𝕨! + 𝛽∇𝑣(𝑆
                                                         55
REINFORCE with Baseline
                          56
REINFORCE with Baseline
                          57
Actor-critic learning
◦ Definition: Methods that learn approximations to both
  policy and value functions are often called actor–critic
  methods,
 ◦ Actor: the learned policy
 ◦ Critic: the learned value function
   ◦ Usually a state-value function
◦ Is policy-based RL with baseline is actor-critic learning?
 ◦ If the baseline is stationary value that never updates with experience à NOT actor-
   critic
 ◦ If the baseline is estimated from experience à can be called an actor-critic
   method
                                                                                     58
Actor-critic learning
                              𝐴! 𝑆! , Θ!
                                ∇,
 Θ!"# ÷ Θ! + 𝛼 𝐺! − 𝑏(𝑆! )
                            , 𝐴! 𝑆! , Θ!
                               ∇, 𝐴! 𝑆! , Θ!
à Θ!"# ÷ Θ! + 𝛼 𝐺! − 𝑣(𝑆
                     P ! , 𝕨) , 𝐴 𝑆 , Θ        REINFORCE with baseline
                                   ! !     !
          𝑅!"# + 𝛾𝑣P 𝑆!"# , 𝕨
                                                     ∇%   𝐴! 𝑆! , Θ!
à Θ!"# ÷ Θ! + 𝛼 𝑅!"# + 𝛾𝑣' 𝑆!"# , 𝕨 − 𝑣' 𝑆! , 𝕨
                                                      %   𝐴! 𝑆! , Θ!
  𝕨!"# ÷ 𝕨! + 𝛽 𝑅!"# + 𝛾𝑣' 𝑆!"# , 𝕨 − 𝑣' 𝑆! , 𝕨 ∇𝑣(𝑆
                                                 ' ! , 𝕨! )
                                                                       59
Agenda
◦ Definition and basic terms
◦ Value-based RL
 ◦ Tabular RL
 ◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking
                               60
    Applications in networking
    ◦ Routing in WSN
       ◦ How to forward packet from S to D?
    ◦ Reinforcement learning modeling
       ◦ Agent: sensor nodes
       ◦ RL model: Q-learning
                                                         Negative rewarding
•   Khanh Le, Nguyen Thanh Hung, Kien Nguyen, Phi Le Nguyen, “Exploiting Q-learning in Extending the Network Lifetime of
    Wireless Sensor Networks with Holes”, ICPADS 2019
•   Phi Le Nguyen, Nang Hung Nguyen, Tuan Anh Nguyen Dinh, Khanh Le, Thanh Hung Nguyen, and Kien Nguyen, “QIH: An
    Efficient Q-Learning Inspired Hole-Bypassing Routing Protocol for WSNs”, IEEE Access, 2021
                                                                                                                           61
Applications in networking
◦ Wireless charging algorithm
    ◦ Where should the mobile charger move to?
    ◦ How long should the mobile charger charge to
      sensors?
◦ Techniques: Q-learning + Fuzzy
•   La Van Quan, Phi Le Nguyen, Thanh-Hung Nguyen, Kien Nguyen, “Q-learning-based, Optimized On-
    demand Charging Algorithm in WRSN”, NCA 2020
•   Phi Le Nguyen, Van Quan La, Anh Duy Nguyen, Thanh Hung Nguyen, and Kien Nguyen, “An On-Demand
    Charging for Connected Target Coverage in WRSNs Using Fuzzy Logic and Q-Learning”, MDPI Sensors 2021
                                                                                                     62
    Applications in networking
    ◦ Offloading in MEC
      ◦ Where to offload the tasks
      ◦ RL model: MAB
      ◦ Exploration strategy
•    Nang Hung Nguyen, Phi Le Nguyen, Hieu Dinh, Thanh Hung Nguyen, Kien Nguyen, “Multi-Agent Multi-Armed Bandit Learning for
     Offloading Delay Minimization in V2X Networks”, EUC 2021
•    Trung Thanh Nguyen, Truong Thao Nguyen, Tuan Anh Nguyen Dinh, Thanh-Hung Nguyen, Phi Le Nguyen, “Q-learning-based
     Opportunistic Communication for Real-time Mobile Air Quality Monitoring Systems”, IPCCC 2021
                                                                                                                      63
Source code
◦ https://colab.research.google.com/drive/1-
  _6DVuOjJWBp2V5x53uTZC9U5Z1IoYiT#scrollTo=GzO_xDRDr5NP
                                                          64