MACHINE LEARNING
DR. S. Chitrakala
        Professor
     DCSE/ CEG/ AU
                           Module 3
1.   The Multi-Layer Perceptron
2.   Back Propagation of Error
3.   Multi-layer Perceptron in Practice
4.   Deriving Back Propagation
5.   Applications of MLP
The Multi-Layer Perceptron
                             Multi-layer Perceptron
   Majority of interesting problems are not linearly separable.
   That problems can be made linearly separable if we can work out how to transform
    the features suitably.
   Consider making more complicated networks.
   Learning in the neural network happens in the weights. So, to perform more
    computation it seems sensible to add more weights.
   There are two things that we can do:
        add some backwards connections, so that the output neurons connect to the inputs again, or
        add more neurons.
                                                                                                      18
   The first approach leads into recurrent networks. but are not
    that commonly used.
   We will instead consider the second approach. We can add
    neurons between the input nodes and the outputs, and this
    will make more complex neural networks,
                                                                    19
XOR: Input (1, 0) corresponds to node A being 1 and B being 0.
   The input to neuron C is therefore −1 × 0.5 + 1 × 1 + 0 × 1 = −0.5 +
    1 = 0.5. This is above the threshold of 0, and so neuron C fires,
    giving output 1.
   For neuron D the input is −1 × 1 + 1 × 1 + 0 × 1 = −1 + 1 = 0, and so
    it does not fire, giving output 0.
   Therefore the input to neuron E is −1×0.5+1×1+0×−1 = 0.5, so
    neuron E fires.
                                                                            20
MLP: GOING FORWARD:
   we start at the left by filling in the values for the inputs.
   We then use these inputs and the first level of weights to
    calculate the activations of the hidden layer, and then we use
    those activations and the next set of weights to calculate the
    activations of the output layer.
   Now that we’ve got the outputs of the network, we can compare
    them to the targets and compute the error.
   We need to include a bias input to each neuron.
   By having an extra input that is permanently set to -1, and
    adjusting the weights to each neuron as part of the training.
   Thus, each neuron in the network (whether it is a hidden layer or
    the output) has 1 extra input, with fixed value.
                                                                        21
GOING BACKWARDS: BACK-PROPAGATION OF ERROR
   The error function that we used for the Perceptron was
                     where N is the number of output nodes.
   However, suppose that we make two errors.
        In the first, the target is bigger than the output, while in the second the
         output is bigger than the target.
        If these two errors are the same size, then if we add them up we could
         get 0, which means that the error value suggests that no error was made.
   To get around this we need to make all errors have the same
    sign.
   sum-of-squares error function
                                                                                       22
   If we differentiate a error function, then it tells us the gradient of
    that function, which is the direction along which it increases and
    decreases the most.
   Since the purpose of learning is to minimize the error, following
    the error function downhill (in other words, in the direction of the
    negative gradient) will give us what we want.
                                                                             23
THE MULTI-LAYER PERCEPTRON ALGORITHM
   There are
        L input nodes, plus the bias,
        M hidden nodes, also plus a bias, and
        N output nodes,
        So that there are (L+1)×M weights between the input and the hidden layer
         and (M+1)×N between the hidden layer and the output.
   The sums that we write will start from 0 if they include the bias
    nodes and 1 otherwise, and run up to L,M, or N, so that x0 = −1 is
    the bias input, and a0 = −1 is the bias hidden node.
   The algorithm that is described could have any number of hidden
    layers, in which case there might be several values for M, and
    extra sets of weights between the hidden layers.
   We will also use i, j, k to index the nodes in each layer                       24
1. An input vector is put into the input nodes
2. The inputs are fed forward through the network
      The inputs and the first-layer weights (here labelled as v) are used to
       decide whether the hidden nodes fire or not. The activation function
       g(·) is the sigmoid function given in Equation (4.2) above
      The outputs of these neurons and the second-layer weights (labelled
       as w) are used to decide if the output neurons fire or not
3. The error is computed as the sum-of-squares difference
between the network outputs and the targets
4. This error is fed backwards through the network in order to
      first update the second-layer weights
      and then afterwards, the first-layer weights
                                                                                 25
26
27
28
Back Propagation Algorithm
INITIALIZING THE WEIGHTS
   Each neuron is getting input from n different places (either input
    nodes if the neuron is in the hidden layer, or hidden neurons if it
    is in the output layer).
   If we view the values of these inputs as having uniform variance,
    then the typical input to the neuron will be
   where w is the initialization value of the weights. So a common
    trick is to set the weights in the range
   where n is the number of nodes in the input layer to those
    weights.
   This makes the total input to a neuron have a maximum size of
    about 1.
                                                                          53
   Further, if the weights are large, then the activation of a neuron is
    likely to be at, or close to, 0 or 1 already, which means that the
    gradients are small, and so the learning is very slow.
   There is an interplay here with the value of         in the logistic
    function, which means that small values of β (say β= 3.0 or less)
    are more effective.
   We use random values for the initialization so that the learning
    starts off from different places for each run, and we keep them all
    about the same size because we want all of the weights to reach
    their final values at about the same time. This is known as uniform
    learning and it is important because otherwise the network will
    do better on some inputs than others.
                                                                            54
DIFFERENT OUTPUT ACTIVATION FUNCTIONS
   In the algorithm described above, we used sigmoid neurons in the
    hidden layer and the output layer. This is fine for classification
    problems, since there we can make the classes be 0 and 1.
   However, we might also want to perform regression problems,
    where the output needs to be from a continuous range, not just 0
    or 1.
   The sigmoid neurons at the output are not very useful in that
    case.
   We can replace the output neurons with linear nodes that just
    sum the inputs and give that as their activation (so g(h) = h ).
   This does not mean that we change the hidden layer neurons;
    they stay exactly the same, and we only modify the output nodes.
                                                                         55
   They are not models of neurons anymore, since they don’t have
    the characteristic fire/don’t fire pattern.
   They enable us to solve regression problems, where we want a
    real number out, not just a 0/1 decision.
   Soft-max activation function is most commonly used for
    classification problems where the 1-of-N output encoding is used
   The soft-max function rescales the outputs by calculating the
    exponential of the inputs to that neuron, and dividing by the total
    sumof the inputs to all of the neurons, so that the activations sum
    to 1 and all lie between 0 and 1.
   As an activation function it can be written as:
                                                                          56
   If we change the activation function, then the derivative of the
    activation function will also change, and so the learning rule will be
    different.
   For the linear activation function
   Derivative of this is
   We can modify the error function as well, to have the cross-
    entropy form (where ln is the natural logarithm):
                                                                             57
    SEQUENTIAL AND BATCH TRAINING
   All of the training examples are presented to the neural network,
    the average sum-of-squares error is then computed, and this is
    used to update the weights.
   Thus there is only one set of weight updates for each epoch (pass
    through all the training examples).
   This means that we only update the weights once for each
    iteration of the algorithm, which means that the weights are
    moved in the direction that most of the inputs want them to
    move, rather than being pulled around by each input individually.
   The batch method performs a more accurate estimate of the
    error gradient, and will converge to the local minimum more
    quickly.
                                                                        58
   The algorithm that was described earlier was the sequential version,
    where the errors are computed and the weights updated after each
    input.
   This is not guaranteed to be as efficient in learning, but it is simpler
    to program when using loops, and it is therefore much more
    common.
   Since it does not converge as well, it can also sometimes avoid local
    minima, thus potentially reaching better solutions.
   randomising the order of the input vectors at each iteration. This can
    significantly improve the speed with which the algorithm learns.
   NumPy     has    a   useful    function   that    assists   with   this,
    np.random.shuffle(), which takes a list of numbers and reorders
    them.
                                                                               59
LOCAL MINIMA
   There may be a much lower point over the next hill, but the ball can’t
    see that, and it doesn’t have enough energy to climb over the hill and
    find the global minimum
   We don’t know where the global minimum is because we don’t know
    what the error landscape looks like;
   we can only compute local features of it for the place we are in at the
    moment.
   Which minimum we end up in depends on where we start.
   If we begin near the global minimum, then we are very likely to end
    up in it, but if we start near a local minimum we will probably end up
    there.
                                                                              60
   In addition, how long it will take to get to the minimum that we do
    find depends upon the exact appearance of the landscape at the
    current point.
   We can make it more likely that we find the global minimum by
    trying out several    different starting points by training several
    different networks, and this is commonly done.
   However, we can also try to make it less likely that the algorithm will
    get stuck in local minima.
                                                                              61
PICKING UP MOMENTUM
   The reason that the ball stops rolling is because it runs out of energy
    at the bottom of the dip.
   If we give the ball some weight, then it will generate momentum as it
    rolls, and so it is more likely to overcome a small hill on the other
    side of the local minimum, and so more likely to find the global
    minimum.
   We can implement this idea in our neural network learning by adding
    in some contribution from the previous weight change that we made
    to the current one.
   In two dimensions it will mean that the ball rolls more directly
    towards the valley bottom, since on average that will be the correct
    direction, rather than being controlled by the local changes
                                                                              62
   There is another benefit to momentum. It makes it possible to use a
    smaller learning rate, which means that the learning is more stable.
   There is another benefit to momentum. It makes it possible to use a
    smaller learning rate, which means that the learning is more stable.
   where t is used to indicate the current update and t − 1 is the
    previous one. 0 < α < 1 is the momentum constant. Typically a value
    of = 0.9 is used.
   Another thing that can be added is known as weight decay. This
    reduces the size of the weights as the number of iterations increases.
   After each learning iteration through all of the input patterns, every
    weight is multiplied by some constant 0 < ε < 1. This makes the
    network simpler and can often produce improved results,
                                                                             63
MINIBATCHES AND STOCHASTIC GRADIENT DESCENT
   Batch algorithm converges to a local minimum faster than the sequential algorithm,
    which computes the error for each input individually and then does a weight update,
    but that the latter is sometimes less likely to get stuck in local minima.
   The reason for both of these observations is that the batch algorithm makes a better
    estimate of the steepest descent direction, so that the direction it chooses to go is a
    good one, but this just leads to a local minimum.
   The idea of a minibatch method is by splitting the training set into random batches,
    estimating the gradient based on one of the subsets of the training set, performing a
    weight update, and then using the next subset to estimate a new gradient and using
    that for the weight update, until all of the training set have been used.
                                                                                              64
   a single input vector is chosen from the training set, and the output
    and hence the error for that one vector computed, and this is used
    to estimate the gradient and so update the weights.
   A new random input vector (which could be the same as the
    previous one) is then chosen and the process repeated.
   This is known as stochastic gradient descent, and can be used for
    any gradient descent problem, not just the MLP.
   It is often used if the training set is very large, since it would be
    very expensive to use the whole dataset to estimate the gradient in
    that case.
                                                                            65
   OTHER IMPROVEMENTS: One is to reduce the learning rate as the
    algorithm progresses. The reasoning behind this is that the network
    should only be making large-scale changes to the weights at the
    beginning, when the weights are random; if it is still making large
    weight changes later on, then something is wrong.
   Second derivatives of the error with respect to the weights
                                                                          66
      Multi-layer Perceptron in
              Practice
1. Amount of Training Data
2. Number of Hidden Layers
3. When to Stop Learning
THE MULTI-LAYER PERCEPTRON IN PRACTICE: AMOUNT OF TRAINING DATA :
   For the MLP with one hidden layer there are (L + 1) ×M + (M + 1) × N weights,
    where L,M,N are the number of nodes in the input, hidden, and output
    layers, respectively.
   The extra +1s come from the bias nodes, which also have adjustable weights.
   This is a potentially huge number of adjustable parameters that we need to
    set during the training phase.
   Setting the values of these weights is the job of the back-propagation
    algorithm, which is driven by the errors coming from the training data.
   Clearly, the more training data there is, the better for learning, although the
    time that the algorithm takes to learn increases.
                                                                                      68
   Unfortunately, there is no way to compute what the minimum
    amount of data required is, since it depends on the problem.
   A rule of thumb that has been around for almost as long as the
    MLP itself is that you should use a number of training examples
    that is at least 10 times the number of weights.
   This is probably going to be a very large number of examples, so
    neural network training is a fairly computationally expensive
    operation, because we need to show the network all of these
    inputs lots of times.
NUMBER OF HIDDEN LAYERS:
   it is possible to show mathematically that one hidden layer with
    lots of hidden nodes is sufficient. This is known as the Universal
    Approximation Theorem;
                                                                         69
   You just have to experiment by training networks with different
    numbers of hidden nodes and then choosing the one that gives the
    best results
   we will never normally need more than two layers (that is, one
    hidden layer and the output layer). This is because we can
    approximate any smooth functional mapping using a linear
    combination of localised sigmoidal functions.
   The basic idea is that by combining sigmoid functions we can
    generate ridge-like functions, and by combining ridge-like functions
    we can generate functions with a unique maximum.
   By combining these and transforming them using another layer of
    neurons, we obtain a localised response (a ‘bump’ function), and any
    functional mapping can be approximated to arbitrary accuracy using
                                                                           70
71
WHEN TO STOP LEARNING
                        72
           Applications of MLP
A Regression Problem
Classification with the MLP
A Classification Example: The Iris Dataset
Time-Series Prediction
Data Compression: The Auto-Associative Network
               EXAMPLES OF USING THE MLP
A REGRESSION PROBLEM
   We will take a set of samples generated by a simple
    mathematical function, and try to learn the generating function
    (that describes how the data was made) so that we can find the
    values of any inputs, not just the ones we have training data for.
                                                                         74
75
76
CLASSIFICATION WITH THE MLP
   There are a couple of choices for the outputs. The first is to use a
    single linear node for the output, y, and put some thresholds on
    the activation value of that node.
   A more suitable output encoding is called 1-of-N encoding.
   A separate node is used to represent each possible class, and the
    target vectors consist of zeros everywhere except for in the one
    element that corresponds to the correct class, e.g., (0, 0, 0, 1, 0, 0)
    means that the correct result is the 4th class out of 6.
   We are therefore using binary output values (we want each output
    to be either 0 or 1).
                                                                              77
   Simply choose the element yk of the output vector that is the largest
    element of y (in mathematical notation, pick the yk for which
    o                    so   this statement says pick the yk that is bigger
    than all other possible values yj ).
   This generates an unambiguous decision, since it is very unlikely that
    two output neurons will have identical largest output values.
   This is known as the hard-max activation function (since the neuron
    with the highest activation is chosen to fire and the rest are ignored).
   An alternative is the soft-max function, which has the effect of
    scaling the output of each neuron according to how large it is in
    comparison to the others, and making the total output sum to 1.
                                                                               78
A CLASSIFICATION EXAMPLE: THE IRIS DATASET
   Classifying examples of three types of iris (flower) by the length and
    width of the sepals and petals and is called iris.
                                                                             79
80
81
TIME-SERIES PREDICTION
   The target data for training the neural network is simple, because it
    comes from further up the time-series, and so training is easy.
    Suppose that = 2 and k = 3.
                                                                            82
   2855 data points. To minimise the sum-of-squares error. Since
    there are no classes, the confusion matrix is not useful
                                                                    83
84
DATA COMPRESSION: THE AUTO-ASSOCIATIVE NETWORK
   Suppose that we train the network to reproduce the inputs at the
    output layer (called auto-associative learning; sometimes the
    network is known as an autoencoder).
   The network is trained so that whatever you show it at the input is
    reproduced at the output.
   This bottleneck hidden layer has to represent all of the information in
    the input, so that it can be reproduced at the output.
   It therefore performs some compression of the data, representing it
    using fewer dimensions than were used in the input.
   They are finding a different (often lower dimensional) representation
    of the input data that extracts important components of the data,
    and ignores the noise.
                                                                              85
86
87
                    A RECIPE FOR USING THE MLP
   Select inputs and outputs for your problem
   Normalise inputs
   Split the data into training, testing, and validation sets
   Select a network architecture
   Train a network
   Test the network
                                                                 88