UNIT II LINEAR MODELS
Multi-layer Perceptron – Going Forwards – Going Backwards: Back Propagation Error –
Multilayer Perceptron in Practice – Examples of using the MLP – Overview – Deriving
BackPropagation – Radial Basis Functions and Splines – Concepts – RBF Network – Curse of
Dimensionality – Interpolations and Basis Functions – Support Vector Machines.
Multi-layer Perceptron
• Parametric nonlinear function approximates f(x; Θ) for classification or regression.
• Originally inspired by neuronal circuits in the brain (McCullough-Pitts neurons, Rosenblatt’s
perceptron, etc.), and by the brain’s ability to solve intelligent problems (visual and speech
perception, navigation, planning, etc.). Synapses between neurons = MLP weight parameters
(which are modified during learning). Neuron firing = MLP unit nonlinearity.
• They are usually implemented in software in serial computers and more recently in parallel or
distributed computers. There also exist VLSI implementations.
11.2 The perceptron
Linear perceptron: computes a linear function y = PD d=1 wdxd + w0 ∈ R of an input vector x ∈
R D with weight vector w ∈ R D (or y = wT x with w ∈ R D+1 and we augment x with a 0th
component of value 1). To have K outputs: y = Wx with W ∈ R K×(D+1)
For classification, we can use:
– Two classes: y = σ(w T x) = 1 1 + exp (−wT x) ∈ (0, 1) (logistic)
. – K > 2 classes: yk = exp (wT k x) PK j=1 exp (wT j x) ∈ (0, 1), k = 1, . . . , K with PK k=1 yk
= 1 (softmax).
Training a perceptron
Apply stochastic gradient descent: to minimize the error E(w) = PN n=1 e(w; xn, yn), repeatedly
update the weights w ← w +∆w with ∆w = −η ∇e(w; xn, yn) and n = 1, . . . , N (one epoch).
– Regression by least-squares error: e(w; xn, yn) = 1 2 (yn−wT xn) 2 ⇒ ∆w = η(yn−wT xn) xn. –
Classification by maximum likelihood, or equivalently cross-entropy:
∗ For two classes: yn ∈ {0, 1} and e(w; xn, yn) = −yn log θn −(1−yn) log (1 − θn) where θn =
σ(wT xn) ⇒ ∆w = η(yn − θn) xn.
∗ For K > 2 classes: yn ∈ {0, 1} K coded as 1-of-K and e(w; xn, yn) = − PK k=1 ykn log θkn
Fig. 11.3 pseudocode where θkn = exp (wT k x) PK j=1 exp (wT j x) ⇒ ∆wkd = η(ykn−θkn) xdn
for d = 0, . . . , D and k = 1, . . . , K.
The original perceptron algorithm was a variation of stochastic gradient descent. For linearly
separable problems, it converges in a finite (possibly large) number of iterations. For problems
that are not linearly separable problems, it never converges.
Learning Boolean functions
• Boolean function: {0, 1} D → {0, 1}. Maps a vector of D bits to a single bit (truth value).
• Can be seen as a binary classification problem where the input instances are binary vectors.
• Since a perceptron can learn linearly separable problems, it can learn AND, OR but not XOR.
Multilayer perceptrons (MLPs
• Multilayer perceptron (or feedforward neural net): nested sequence of perceptrons, with an
input layer, an output layer and zero or more hidden layers: x perceptrons −−−−−−→ ·
perceptrons −−−−−−→ . . . perceptrons −−−−−−→ y
• It can represent nonlinear discriminants (for classification) or functions (for regression).
• Architecture of the MLP: each layer has several units. Each unit h takes as input the output z of
the previous layer’s units and applies to it a linear function wT h z (using a weight vector wh,
including a bias term) followed by a nonlinearity s(t): output of unit h = s(wT h z)
Typical nonlinearity functions s(t) used
Logistic function: s(t) = 1 1+e−t (or softmax).
– Hyperbolic tangent: s(t) = tanh t = e t−e −t e t+e−t .
– Rectified linear unit (ReLU): s(t) = max (0, t).
– Step function: s(t) = 0 if t < 0, else 1.
– Identity function: s(t) = t (no nonlinearity)
The output layer uses as nonlinearity:
– For regression: the identity (so the outputs can take any real value).
– For classification: the sigmoid and a single unit (K = 2 classes), or the softmax and K units (K
> 2 classes). All nonlinearities (sigmoid, etc.) have a similar universal approximation ability, but
some (e.g. ReLU) are easier to optimize than others.
Ex: an MLP with one hidden layer having H units, with inputs x ∈ R D and outputs y ∈ R D′ ,
where the hidden units are sigmoidal and the output units are linear
These are generally known as going forwards and backwards through the network.
GOING FORWARDS
We’ve already seen how to go forwards for the MLP when we saw the XOR example above,
which was effectively the recall phase of the algorithm. It is pretty much just the same asthe
Perceptron, except that we have to do it twice, once for each set of neurons, and we need to do it
layer by layer, because otherwise the input values to the second layer don’t exist. In fact, having
made an MLP with two layers of nodes, there is no reason why we can’t make one with 3, or 4,
or 20 layers of nodes (we’ll discuss whether or not you might want to in Section 4.3.2). This
won’t even change our recall (forward) algorithm much, since we just work forwards through the
network computing the activations of one layer of neurons and using those as the inputs for the
next layer.
we start at the left by filling in the values for the inputs. We then use these inputs and the first
level of weights to calculate the activations of the hidden layer, and then we use those activations
and the next set of weights to calculate the activations of the output layer. Now that we’ve got
the outputs of the network, we can compare them to the targets and compute the error.
Biases
We need to include a bias input to each neuron. We do this in the same way as we didfor the
Perceptron in Section 3.3.2, by having an extra input that is permanently set to -1,and adjusting
the weights to each neuron as part of the training. Thus, each neuron in thenetwork (whether it is
a hidden layer or the output) has 1 extra input, with fixed value.
GOING BACKWARDS: BACK-PROPAGATION OF ERROR
It is in the backwards part of the algorithm that things get tricky. Computing the error sat the
output is no more difficult than it was for the Perception, but working out what to-do with those
errors is more difficult. The method that we are going to look at is called back-propagation of
error, which makes it clear that the errors are sent backwards through the network.
The best way to describe back-propagation properly is mathematically, but this can be
intimidating and difficult to get a handle on at first. I’ve therefore tried to compromise by using
words and pictures in the main text, but putting all of the mathematical details into Section 4.6.
While you should look at that section and try to understand it, it can be skipped if you really
don’t have the background. Although it looks complicated, there are actually just three things
that you need to know, all of which are from differential calculus: the derivative of 12x2, the fact
that if you differentiate a function of x with respect to some other variable t, then the answer is 0,
and the chain rule, which tells you how to differentiate composite functions.
The error function that we used for the Perceptron was PNk=1 Ek =PNk=1 yk −tk, whereN is the
number of output nodes. However, suppose that we make two errors. In the first, the target is
bigger than the output, while in the second the output is bigger than the target. If these two errors
are the same size, then if we add them up we could get 0, which means that the error value
suggests that no error was made.
.
There are only three things in the network that change: the inputs, the activation function that
decides whether or not the node fires, and the weights. The first and second are out of our control
when the algorithmis running, so only the weights matter, and therefore they are what we
differentiate with respect to.
which is the hyperbolic tangent function. This is a different but similar function; it is still a
sigmoid function, but it saturates (reaches its constant values) at ±1 instead of 0 and 1, which is
sometimes useful. It also has a relatively simple derivative: ddx tanh x = (1 − tanh2(x)).We can
convert between the two easily, because if the saturation points are (±1), then we can convert to
(0, 1) by using 0.5 × (x + 1).
• for the output neurons, we don’t know the inputs.
• for the hidden neurons, we don’t know the targets; for extra hidden layers, we knowneither the
inputs nor the targets, but even this won’t matter for the algorithm wederive.
If we use the chain rule of differentiation that you all (possibly) remember fromhigh school then
we can get around this problem. Here, the chain rule tells us that if wewant to know how the
error changes as we vary the weights, we can think about how the error changes as we vary the
inputs to the weights, and multiply this by how those input values change as we vary the weights.
This is useful because it lets us calculate all of the derivatives that we want to: we can write the
activations of the output nodes in terms of the activations of the hidden nodes and the output
weights, and then we can send the error calculations back through the network to the hidden
layer to decide what the target outputs were for those neurons. Note that we can do exactly the
same computations if the network has extra hidden layers between the inputs and the outputs. It
gets harder to keep track of which functions we should be differentiating, but there are no new
tricks needed.
All of the relevant equations are derived in Section 4.6, and you should read that section
carefully, since it is quite difficult to describe exactly what is going on here in words. The
important thing to understand is that we compute the gradients of the errors with respect to the
weights, so that we change the weights so that we go downhill, which makes the errors get
smaller.
This leads to two different update functions, one for each of the sets of weights, and we just
apply these backwards through the network, starting at theoutputs and ending up back at the
inputs.
THE MULTI-LAYER PERCEPTRON IN PRACTICE
MLP is widely used for solving problems that require supervised learning as well as
research into computational neuroscience and parallel distributed processing. Applications
include speech recognition, image recognition and machine translation.
Amount of Training Data
For the MLP with one hidden layer there are (L + 1) × M + (M + 1) × N weights,
whereL, M, N are the number of nodes in the input, hidden, and output layers,
respectively.
The extra +1s come from the bias nodes, which also have adjustable weights. This is
potentially huge number of adjustable parameters that we need to set during the training
phase. Setting the values of these weights is the job of the back-propagation algorithm,
which is driven by the errors coming from the training data. Clearly, the more training
data there is, the better for learning, although the time that the algorithm takes to learn
increases.
A rule of thumb that has been around for almost as long as the MLP itself is that you
should use a number of training examples that is at least 10 times the number of weights.
This is probably going to be a very large number of examples, so neural network training
is a fairly computationally expensive operation, because we need to show the network all
of these inputs lots of times.
Number of Hidden Layers
There are two other considerations concerning the number of weights that are inherent
ithe calculation above, which is the choice of the number of hidden nodes, and the number
ofhidden layers. Making these choices is obviously fundamental to the successful
applicationof the algorithm. We will shortly see a pictorial demonstration of the fact that
two hiddenlayers is the most that you ever need for normal MLP learning. In fact, this
result canbe strengthened: it is possible to show mathematically that one hidden layer with
lots ofhidden nodes is sufficient.
Two hidden layers are sufficient to compute these bump functions for different inputs and so if
the function that we want to learn (approximate) is continuous, the network can compute it. It
can therefore approximate any decision boundary, not just the linear one that the Perceptron
computed.
EXAMPLES OF USING THE MLP
The MLP is rather too complicated to enable us to work through the weight changes as we did
with the Perceptron.
Instead, we shall look at some demonstrations of how to make the network learn about some
data. As was mentioned above, we shall look at the four types of problems that are generally
solved using an MLP: regression, classification, time-series prediction, and data
compression/data denoising.
1 A Regression Problem
The regression problem we will look at is a very simple one.
ones we have training data for.The function that we will use is a very simple one, just a
bit of a sine wave. We’ll makethe data in the following way (make sure that you have
NumPy imported as np first):
x = np.ones((1,40))*np.linspace(0,1,40)
t = np.sin(2*np.pi*x) + np.cos(4*np.pi*x) + np.random.randn(40)*0.2
x = x.T
t = t.T
The reason why we have to use the reshape() method is that NumPy defaults to lists for
arrays that are N ×1; compare the results of the np.shape() calls below, and the effectof the
transpose operator .T on the array
>>> x = np.linspace(0,1,40)
>>> np.shape(x)
(40,)
>>> np.shape(x.T)
(40,)
>>>
>>> x = np.linspace(0,1,40).reshape((1,40))
>>> np.shape(x)
(1, 40)
>>> np.shape(x.T)
(40, 1)
You can plot this data to see what it looks like
>>> import pylab as pl
>>> pl.plot(x,t,’.’)
There is one input value, x and one output value t, so the neural network will have one input
and one output. Also, because we want the output to be the value of the function, rather than 0
or 1, we will use linear neurons at the output.
Before getting started, we need to normalise the data using the method shown in Sec-tion 3.4.5,
and then separate the data into training, testing, and validation sets. For this example there are
only 40 data points, and we’ll use half of them as the training set, although that isn’t very many
and might not be enough for the algorithm to learn effectively. We can split the data in the ratio
50:25:25 by using the odd-numbered elements as training data, the even-numbered ones that do
not divide by 4 for testing, and the rest for validation:
train = x[0::2,:]
test = x[1::4,:]
valid = x[3::4,:]
traintarget = t[0::2,:]
testtarget = t[1::4,:]
validtarget = t[3::4,:]
With that done, it is just a case of making and training the MLP. To start with, we willconstruct
a network with three nodes in the hidden layer, and run it for 101 iterationswitha learning rate
of 0.25,
>>> import mlp
>>> net = mlp.mlp(train,traintarget,3,outtype=’linear’)
>>> net.mlptrain(train,traintarget,0.25,101)