CSEN3082 DEEP LEARNING
O. V. Ramana Murthy
    Course Outcome 1
    Understand the role of neural networks and
    its various applications.
2
    Contents
    Artificial Neuron
    Feed Forward Networks
    Gradient descent
    Back propagation
    Regularization techniques
    Norm penalties as constrained optimization
3
    Reference
    Chapter 1, 2
    Charu C. Aggarwal, “Neural Networks and
    Deep Learning”, Springer International
    Publishing AG
    Deep Learning- Charu Aggarwal
    https://www.youtube.com/playlist?list=PLLo1
    RD8Vbbb_6gCyqxG_qzCLOj9EKubw7
    Ian Goodfellow, Yoshua Bengio, Aaron
     Courville, Deep Learning, MIT Press, 2016
4
    Neuron
5
     Perceptron
    The simplest neural network is referred to as the perceptron. This neural
    network contains a single input layer and an output node.
6
         Artificial Neuron
    x1                 1
                w1
                               b   fact
                                          ^
                                          𝑦
               w2
    x2                  
                           y
    xn          wn
7
    Numerical Example 1
    Calculate the output assuming Binary step
    activation function 1
             0.3
                           0.45 fact
    0.2                                ^
                                       𝑦
                       y=0.93
             0.7
    0.6
8
           ^
                                  {
           𝑦 = 𝑓 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 ( 𝑦 )= 1 𝑖𝑓𝑦 ≥ 0
                                   0 𝑖𝑓𝑦 <0
     Artificial Neuron model
     Sigmoidal function
     =1 is Binary sigmoidal function
            w1
    x1                    y
            w2
9   x2
ACTIVATION FUNCTIONS
                (A)Identity/linear
                (B) Bipolar Binary step
                (C) Bipolar sigmoidal step
                                             10
   Source [2]
     Numerical Example 2
     Calculate the output assuming binary sigmoidal
     activation function 1
              0.3
                            0.45 fact
     0.2                                ^
                                        𝑦 =0.72
                          y=0.93
              0.7
     0.6
            =1 is Binary sigmoidal
11          function
     Pre- and Post-Activation
     Values
12
     Gradient Descent
     Gradient represents how STEEP a slope is.
13
     Gradient Descent
     Given a differentiable function f(x),
     gradient descent finds the minimum
     by updating the variable x in steps
     proportional to the negative of the gradient
     (derivative) of f’(x) at the current point.
14
     Gradient Descent
      Gradient represents how STEEP a slope is.
      Uphill is positive; downhill is negative.
      One direction – Derivative. multi-direction – Gradient.
      Drawback: Settles to local minimum
15
     Gradient Search
     For unconstrained optimization i.e., no
     constraints
     Gradient of function exists
     1. Initialization. Choose an initial xk and let k =
        0
     2. Calculate the derivative, which indicates the
        slope of at
     3. Update . Where is learning rate
     4. Repeat steps 2 -3 till convergence i.e. , or the
        maximum number of iterations is reached.
16
     Gradient function
17
     Case 1: Initial value x0=−2
Iterati                       x_n+1=x_n−αf′
         x_n     f'(x_n)                          f(x_n)
on (n)                             (x_n)
      0−2    f′(−2)=−16 −2+0.1×16=−0.4- f(−2)=12
             f′             −0.4−0.1×1.28=− f(−0.4)=3.
      1−0.4
             (−0.4)=1.28 0.528                 23
             f′
        −0.5                −0.528−0.1×1.39= f(−0.528)
      2      (−0.528)=1.
        28                  −0.667             =2.9
             39
             f′
        −0.6                −0.667−0.1×1.48= f(−0.667)
      3      (−0.667)=1.
        67                  −0.815 toward the local
                                               =2.65
             f′ at 𝑥=−1
      The trajectory
             48       is converging
      minimum
        −0.8                −0.815−0.1×1.6= f(−0.815)
      4      (−0.815)=1.
        15                  −0.975             =2.5
 18          6
     Case 2: Initial value x0=2
Iterati                        x_n+1=x_n−αf′
         x_n      f'(x_n)                           f(x_n)
on (n)                              (x_n)
      0      2f′(2)=16       2−0.1×16=0.42       f(2)=12
              f
                             0.4−0.1×−1.28=0. f(0.4)=3.2
      1 0.4 ′(0.4)=−1.2
                             528                 3
              8
              f
                             0.528−0.1×−1.39= f(0.528)=2
      2 0.528 ′(0.528)=−1
                             0.667               .9
              .39
              f
                             0.667−0.1×−1.48=    f(0.667)=2
                     𝑥
      The  trajectory  is
      3 0.667 ′(0.667)=−1 converging toward the local
      minimum     at  =1     0.815               .65
              .48
              f
                             0.815−0.1×−1.6=0 f(0.815)=2
 19   4 0.815 ′(0.815)=−1
                             .975                .5
     Effect of learning rate
20
     Chain Rule
     The chain rule is applied to calculate the
     gradients of the loss function with respect to the
     weights and biases (different layers) of the
     network. These gradients are then used to
     update the parameters during optimization
     (e.g., using gradient descent).
     The Chain Rule: Formula
     If you have a composite function , the derivative
     with respect to x is given by:
     This principle is used extensively in updating the
     weights of different layers using the Loss
21   function at the outside layer
     Linear Separability
22
     Linearly Separable – AND
     gate
 x1      x2      Y
 (inpu   (inpu   (outpu
 t)      t)      t)
 0       0       0
 0       1       0
 1       0       0
 1       1       1
23
     Linearly Separable – AND
     gate
     Two input sources => Two input neurons
     One output => One output Neuron.
      Activation function is binary sigmoidal
      Derivative
24
     Linearly Separable – AND
     gate
             1
                 w0
     x1     w1
                          𝑦         Y
                             f(.
            w2                )
     x2
25
     Back-propagation training/algorithm
     Given: Input vector i th instant , Target .
     Initialize weights w0, w1, w2 and learning rate
     with some random values in the range [0 1]
     1. Output
     2. Activation function sigmoidal activation
         function
     3. Compute error:
     4. Backpropagate the error to crossing
         activation function
       where is the derivative of activation function
     selected.
26
      for sigmoidal activation function
     Back-propagation training/algorithm
     5. Compute change in weights and bias
     ,,
     6. Update the changes in weights and bias
     7. Keep repeating the steps 1 – 6, for all input
        combinations ( 4 nos). This is one epoch.
     8. Run multiple Epochs till the error decreases and
        stabilizes.
27
         (4 Rules)Backpropagating
         Error
                                               1. Output
                          𝑦            Y       Neuron
                                f(.
                       𝑒 𝑎𝑡 𝑠𝑡𝑎𝑔𝑒)𝑦
                                             2. Across Link
                   wi            𝑦
    xi                                f(.
                                       ) 𝑦
                               𝑒 𝑎𝑡 𝑠𝑡𝑎𝑔𝑒
𝑒 𝑎𝑡 𝑠𝑡𝑎𝑔𝑒 𝑥   𝑖          3. Weights Update
  28
     (4 Rules)Backpropagating
     Error
     𝑒 𝑎𝑡 𝑠𝑡𝑎𝑔𝑒 𝑥   𝑖
                                    4. Across Link (>1
          xi                        hidden layer)
                        w1
                                   𝑦1
                                  f(.
                          w2 𝑒
                                𝑦2 )
                                  𝑎𝑡 𝑠𝑡𝑎𝑔𝑒 𝑦 1
                                  f(.
                        wn    𝑒
                                𝑦𝑛 )
                                  𝑎𝑡 𝑠𝑡𝑎𝑔𝑒 𝑦 2
29
                                  f(.
                                 𝑒 𝑎𝑡 𝑠𝑡𝑎𝑔𝑒 𝑦   𝑛
       The power of nonlinear activation
     functions in transforming a data set to
                linear separability
30
     Linearly not Separable – XOR
     gate
 x1      x2      Y
 (inpu   (inpu   (outpu
 t)      t)      t)
 0       0       0
 0       1       1
 1       0       1
 1       1       0
31
     Linearly not Separable – XOR
     gate
     Two input sources => Two input neurons
     One output => One output Neuron.
     One hidden layer => 2 neurons
      Activation function is binary sigmoidal
      Derivative
32
     Linearly not Separable – XOR
     gate
33
     Input               Hidden                  Output
     layer               layer                   layer
                   1                    1
                       v01
                                            w0
             v11                   Z1
     x1
          v12                               w1
                             v02
                                            w2
                                                      Y
          v21
     x2
             v22                   Z2
34
                     Back-propagation Training
                     Given: Inputs , target .
                     Initialize weights and learning rate with
Feed-forward Phase
                     some random values
                     1. Hidden unit , j = 1 to p hidden neurons
                     2. output , sigmoidal activation function
                     3. Output unit
                     4. Output sigmoidal activation function
       35
                                  Back-propagation Training
Back-propagation of error Phase
                                  5. Compute error correction term
                                                  where is derivative
                                  6.   Compute change in weights and bias
                                                   ,
                                                    send to previous layer
                                  7.   Hidden unit
                                  8.   Calculate error term
                                  9.   Compute change in weights and bias
                                                        ,
         36
Weights and Bias update phase
                                Back-propagation Training
                                10. Each output unit, k = 1 to m update weights and bias
                                11. Each hidden unit, j = 1 to p update weights and bias
                                12. Check for stopping criterion e.g. certain number of
                                   epochs or when targets are equal/close to network
                                   outputs
               37
     Hidden neuron input computation
                 𝑧 1 =v 11 𝑥 1+ v 2 1 𝑥 2 +𝑣 01
                  1                    1
                      v01
                                           w0
           v11                    Z1
     x1
          v12                              w1
                            v02
                                           w2
                                                  Y
          v21
     x2
           v22                    Z2
                 𝑧 2=v 12 𝑥 1+ v 22 𝑥2 + 𝑣 02
38
     Hidden neuron output computation
                                 𝑍 1= 𝑓 𝑠𝑖𝑔 ( 𝑧 1 )
                 1                     1
                     v01
                                           w0
           v11                   Z1
     x1                    𝑧1
          v12                              w1
                           v02
                                           w2
                                                      Y
          v21
     x2                     𝑧2
           v22                    Z2
                                 𝑍 2= 𝑓 𝑠𝑖𝑔 ( 𝑧 2 )
39
     Output neuron input computation
                                 𝑦 =w 1 𝑍 1+w 2 𝑍 2+𝑤 0
                 1                    1
                     v01
                                          w0
           v11                   Z1
     x1                    𝑧1
          v12                             w1
                           v02
                                          w2
                                                   Y
          v21
     x2                     𝑧2
           v22                   Z2
40
     Output neuron Output computation
                                               𝑌 = 𝑓 𝑠𝑖𝑔 ( 𝑦 )
                 1                    1
                     v01
                                          w0
           v11                   Z1
     x1                    𝑧1
          v12                             w1
                           v02
                                                   ( 𝑦 )Y
                                          w2
          v21
     x2                     𝑧2
           v22                   Z2
41
     Output Error correction computation
                                                       ′
                                          𝛿=(𝑡 −𝑌 ) 𝑓 ( 𝑦 )
                 1                    1
                     v01
                                          w0
           v11                   Z1
     x1                    𝑧1
          v12                             w1
                           v02
                                                 ( 𝑦 )Y
                                          w2
          v21
     x2                     𝑧2
           v22                   Z2
42
     Output neuron     changes        updates
     computation
                                          ∆ 𝑤 0=1. 𝛿
                 1                    1   ∆ 𝑤 1=𝛿. 𝑍 1
                     v01
                                          w0
           v11                   Z1
     x1                    𝑧1
          v12                             w1
                           v02
                                                 ( 𝑦 )Y
          v21
                                           w2
                                                  𝛿
     x2                     𝑧2
           v22                   Z2       ∆ 𝑤 2=𝛿. 𝑍 2
43
     Hidden neuron     error     propagation
     computation
                                      𝛿1=𝛿. 𝑤 1
                 1                     1
                     v01
                                           w0
           v11                   Z1
     x1                    𝑧1
          v12                              w1
                           v02
                                                  ( 𝑦 )Y
          v21
                                           w2
                                                   𝛿
     x2                     𝑧2
           v22                   Z2
                                      𝛿2=𝛿. 𝑤 2
44
     Hidden neuron          error          correction
     computation                  ′
                 𝛿11=𝛿1 . 𝑓 ( 𝑧 1 )
                  1                         1
                      v01
                                                w0
           v11                        Z1
     x1                      𝑧1
          v12                𝛿1                 w1
                            v02
                                                        ( 𝑦 )Y
          v21
                                                w2
                                                         𝛿
     x2                      𝑧2
           v22
                             𝛿2′ Z2
                 𝛿22=𝛿2 . 𝑓 ( 𝑧 2 )
45
      Hidden neuron changes updates
     ∆computation
      𝑣 01=𝛿11 ∆ 𝑣 11=𝛿11 𝑥 1                2
                  1                 1
                      v01
                                        w0
           v11
     x1               𝛿11𝑧 1Z1
          v12                𝛿1         w1
                            v02
                                             ( 𝑦 )Y
          v21
                                        w2
                                                 𝛿
     x2
                      𝛿22
                         𝑧 2
           v22
                             𝛿2Z2
     ∆ 𝑣 02=𝛿22               1              2
46
     NN with Two Hidden Layers
     (HW)
           (1)
        𝑤𝑖 , 𝑗
47
     Regularization (to avoid
     Overfitting)
     One of the primary causes of corruption of
     the generalization process is overfitting.
     The objective is to determine a curve that
     defines the border of
     the two groups using the training data.
48
     Overfitting
     One of the primary causes of corruption of
     the generalization process is overfitting.
     The objective is to determine a curve that
     defines the border of
     the two groups using the training data.
49
     Overfitting
     Some outliers penetrate the area of the other group
     and disturb the boundary. As Machine Learning
     considers all the data, even the noise, it ends up
     producing an improper model (a curve in this case).
     This would be penny-wise and pound-foolish.
50
     Remedy : Regularization
     Regularization is a numerical method that
      attempts to construct a model structure as
      simple as possible. The simplified model
      can avoid the effects of overfitting at the
      small cost of performance.
     Cost function Sum of squared errors
51
     Remedy : Regularization
     For  this reason, overfitting of the neural
      network can be improved by adding the sum
      of weights to the cost function, (new) Cost
      function
     In order to drop the value of the cost function,
      both the error and weight should be controlled
      to be as small as possible.
     However, if a weight becomes small enough,
      the associated nodes will be practically
      disconnected. As a result, unnecessary
      connections are eliminated, and the neural
      network becomes simpler.
52
     Add L1 Regularization to XOR
     Network
     New Loss function
     The gradient of the regularized loss w.r.t a
      weight w is:
     Update rule for weights w is
53
     Add L2 Regularization to XOR
     Network
     New Loss function
     The gradient of the regularized loss w.r.t a
      weight w is:
     Update rule for weights w is
54
     XOR implementation with L1
      # Apply L2 regularization to weights
     hidden_layer_weights += learning_rate *
     (np.dot(hidden_layer_output.T, output_layer_delta)
     – sign(hidden_layer_weights))
     input_layer_weights += learning_rate *
     (np.dot(inputs.T, hidden_layer_delta) -
     sign(input_layer_weights))
     # Update biases (no regularization applied to
     biases) hidden_layer_bias +=
     np.sum(output_layer_delta, axis=0,   keepdims=True)
     * learning_rate
     input_layer_bias += np.sum(hidden_layer_delta,
     axis=0, keepdims=True) * learning_rate
55
     XOR implementation with L2
      # Apply L2 regularization to weights
     hidden_layer_weights += learning_rate *
     (np.dot(hidden_layer_output.T, output_layer_delta)
     - hidden_layer_weights)
     input_layer_weights += learning_rate *
     (np.dot(inputs.T, hidden_layer_delta) -
     input_layer_weights)
     # Update biases (no regularization applied to
     biases) hidden_layer_bias +=
     np.sum(output_layer_delta, axis=0,   keepdims=True)
     * learning_rate
     input_layer_bias += np.sum(hidden_layer_delta,
     axis=0, keepdims=True) * learning_rate
56
     Norm Penalties as Constrained
     Optimization
     Denote Regularized Objective function
      is a loss function e.g., MSE, Cross-entropy
      is a penalty function.
      is hyerparameter. Setting α to 0 results in no
     regularization. Larger values of α correspond
     to more regularization.
     L2 Norm Parameter Regularization (ridge
        regression or Tikhonov regularization)
     If
     L1 Norm Regularization
57
     If
     Norm Penalties as Constrained
     Optimization
     Consider the Objective function with L2 Norm
     Regularization
     Transform the L2 regularization term into a
     constraint:
     Subject to
     Where τ>0: A constraint hyperparameter
     controlling the regularization strength.
     • Larger τ: Weaker regularization (equivalent to
       smaller ).
     • Smaller τ: Stronger regularization (equivalent
58
       to larger ).
     Norm Penalties as Constrained
     Optimization
     The constrained optimization can be
     reformulated using the Lagrangian multiplier
     λ
     Here, λ acts as a penalty parameter for
     violating the constraint
       This formulation helps in using techniques
     like dual optimization or projected
     gradient descent to enforce constraints
     during optimization.
59
     Norm Penalties as Constrained
     Optimization – Backpropagation
     Consider
     Containing the MSE and the constraint term
     1. Compute Gradients for MSE Term
     The gradients of the MSE term with respect to the
     weights and biases are:
     Propagate through the layers as in standard
     backpropagation.
     2. Compute Gradients for L2 Constraint
     The L2 regularization gradient is:
60
     This term is added to the gradient from the MSE
     during weight updates.
     Norm Penalties as Constrained
     Optimization – Backpropagation
     Enforcing the Constraint ()
     To ensure the constraint is satisfied, after
     each weight update:
     1.Check the L2 Norm
     2. Project Back if Necessary: If , rescale W
     as follows
61
     Numerical Example
     Consider two inputs, one hidden layer with
     two neurons and one output. NO bias for any
     neuron.
     1. Let initial weights ,
     2. Using back propagation, say we obtained ,
        ,
     3. For learning rate , update
62
     Numerical Example
     4. Updated, new weights ,
     5. Compute Current L2 Norm
     6. If τ=1.5, the constraint is already
        satisfied, and no adjustment is required.
63
     Numerical Example
     7. If τ=1.5, the constraint  is already satisfied,
        and no adjustment is required.
     8. If τ=1.0, the constraint is violated. Then do
        the following adjustment
64
65
     Appendix: Example
     Implementation
  Using       Back-
  propagation
  network, find the
  new weights for
  the      network
  shown      aside.
  Input = [0 1]
  and        target
  output is 1. use
  learning     rate
  0.25 and binary
  sigmoidal
  activation
66function
     1. Consolidate the
     information
     Given: Inputs [0 1], target 1.
     [
     []=[0.4 0.1 0.2]
     Learning rate
     Activation function is binary sigmoidal
     Derivative
67
     2. Feed-forward Phase
     1. Hidden unit , j = 1,2
     2. Output , sigmoidal activation function
     3. Output unit
     4.   Output sigmoidal activation function
68
     2. Feed-forward Phase
     1. Hidden unit , j = 1,2
     2. Output , sigmoidal activation function ,
     3. Output unit
     4. Output sigmoidal activation function
69
         3. Back-propagation of error
         Phase
     5. Compute error correction term
     6. Compute change in weights and bias
     ,
     ,
     ,
70
         3. Back-propagation of error
         Phase
     5. Compute error correction term
     6. Compute change in weights and bias
     ,
     ,
     ,
     7.   Hidden unit
71
         3. Back-propagation of error
         Phase
     5. Compute error correction term
     6. Compute change in weights and bias
     ,
     ,
     ,
     7.   Hidden unit
72
     3. Back-propagation of error
     Phase
8. Calculate error term
73
     3. Back-propagation of error
     Phase
8. Calculate error term
74
     3. Back-propagation of error
     Phase
8. Calculate error term
9. Compute change in weights and bias
                  ,
75
     3. Back-propagation of error
     Phase
8. Calculate error term
9. Compute change in weights and bias
                  ,
0.0118
0.0118
76
     3. Back-propagation of error
     Phase
8. Calculate error term
9. Compute change in weights and bias
                  ,
0.0118
0.0118
0.0
0.00245
77
     3. Back-propagation of error
     Phase
8. Calculate error term
9. Compute change in weights and bias
                  ,
0.0118
0.0118
0.0
0.00245
78
     4. Weights and Bias update
     phase
     10. Each output unit, k = 1 to m update weights and bias
79
     4. Weights and Bias update
     phase
     10. Each output unit, k = 1 to m update weights and bias
80
     4. Weights and Bias update
     phase
     11. Each hidden unit, j = 1 to p update weights and bias
81
     4. Weights and Bias update
     phase
     11. Each hidden unit, j = 1 to p update weights and bias
82
     4. Weights and Bias update
     phase
     11. Each hidden unit, j = 1 to p update weights and bias
83
     Epo v 11      v21      v01   v12    v22    v02
     ch
     0   0.6       -0.1     0.3   -0.3   0.4   0.5
     1   0.6       -0.097   0.303 -0.3   0.401 0.501
     Epoc     z1     z2     w1     w2    w0     y
     h
     0        0.549 0.711 0.4      0.1   -0.2   0.523
     1        0.551 0.711 0.416 0.121 -0.17     0.536
              3     3                           3
         Write a program for this case and cross-
         verify your answers. After how many epochs
         will the output converge?
84