9.
DEEP NETWORKS
9.1 Deep Feed forward Networks
Deep feedforward networks also often called feedforward neural networks, or multilayer perceptrons
approximate some function f ∗.
(MLPs), are the quintessential deep learning models. The goal of a feed forward network is to
For example:
For a classifier, y = f ∗(x) maps an input x to a category y.
A feed forward network defines a mapping y = f(x; θ) and learns the value of the parameters θ that
result in the best function approximation.
These models are called feed forward because information flows through the function being
evaluated from x, through the intermediate computations used to define f, and finally to the output
y.
There are no feedback connections in which outputs of the model are fed back into itself.
When feed forward neural networks are extended to include feedback connections, they are called
recurrent neural networks
Feed forward networks are of extreme importance to machine learning practitioners.
Feed forward neural networks are called networks because they are typically represented by
composing together many different functions.
The model is associated with a directed acyclic graph describing how the functions are composed
together.
For example:
o We might have three functions f (1) , f (2) , and f (3) connected in a chain, to form
f(x) = f (3)(f (2)(f (1)(x))).
o These chain structures are the most commonly used structures of neural networks.
o In this case, f (1) is called the first layer of the network, f (2) is called the second layer, and so on.
o The overall length of the chain gives the depth of the model. It is from this terminology that the
name “deep learning” arises.
training, we drive f(x) to match f ∗(x).
o The final layer of a feedforward network is called the output layer. During neural network
6.1 Example: Learning XOR
To make the idea of a feedforward network more concrete, we begin with an example of a fully
functioning feedforward network on a very simple task: learning the XOR function.
The XOR function (“exclusive or”) is an operation on two binary values, x1 and x2.
When exactly one of these binary values is equal to 1, the XOR function returns 1.
Otherwise, it returns 0. The XOR function provides the target function y = f ∗(x) that we want to learn.
Our model provides a function y = f(x;θ) and our learning algorithm will adapt the parameters θ to make f
as similar as possible to f ∗.
Our network to perform correctly on the four points X = {[0, 0]T, [0,1]T, [1, 0]T, and [1, 1]T}. We will train
the network on all four of these points. The only challenge is to fit the training set.
We can treat this problem as a regression problem and use a mean squared error loss function.
We choose this loss function to simplify the math for this example as much as possible.
Evaluated on our whole training set, the MSE loss function is
Figure : Solving the XOR problem by learning a representation
Figure: An example of a feedforward network, drawn in two different styles
o In modern neural networks, the default recommendation is to use the rectified linear unit or ReLU
defined by the activation function g(z) = max{0, z} depicted in figure
Figure : The rectified linear activation function
6.2 Gradient-Based Learning
For feedforward neural networks, it is important to initialize all weights to small random values.
The biases may be initialized to zero or to small positive values.
The iterative gradient-based optimization algorithms used to train feedforward networks and
almost all other deep models.
a) Cost Functions:
An important aspect of the design of a deep neural network is the choice of the cost
function.
Fortunately, the cost functions for neural networks are more or less the same as those for
other parametric models, such as linear models.
In most cases, our parametric model defines a distribution p(y | x;θ ) and we simply use the
principle of maximum likelihood.
This means we use the cross-entropy between the training data and the model’s predictions
as the cost function.
Learning Conditional Distributions with Maximum Likelihood:
Most modern neural networks are trained using maximum likelihood.
This means that the cost function is simply the negative log-likelihood, equivalently describe as
the cross-entropy between the training data and the model distribution.
This cost function is given by
up to a scaling factor of 1 2 and a term that does not depend on θ
Learning Conditional Statistics:
Instead of learning a full probability distribution p(y | x; θ) we often want to learn just one
conditional statistic of y given.
For example:
o We can design the cost functional to have its minimum lie on the function that maps x to the
expected value of y given x.
o Solving an optimization problem with respect to a function requires a mathematical tool called
calculus of variations
Output Units
The choice of cost function is tightly coupled with the choice of output unit.
i) Linear Units for Gaussian Output Distributions
o One simple kind of output unit is an output unit based on an affine transformation with no
nonlinearity.
o These are often just called linear units. Given features h, a layer of linear output units produces a
vector yˆ = WTh+b.
o Linear output layers are often used to produce the mean of a conditional Gaussian distribution:
ii) Sigmoid Units for Bernoulli Output Distributions:
o The maximum-likelihood approach is to define a Bernoulli distribution over y conditioned on x
o A Bernoulli distribution is defined by just a single number.
o The neural net needs to predict only P(y = 1 | x).
o For this number to be a valid probability, it must lie in the interval [0, 1].
o Satisfying this constraint requires some careful design effort.
Suppose we were to use a linear unit, and threshold its value to obtain a valid probability:
where σ is the logistic sigmoid function
iii) Softmax Units for Multinoulli Output Distributions
o Softmax functions are most often used as the output of a classifier, to represent the probability
distribution over n different classes
o In case of binary variables, we wished to produce a single number yˆ = P(y = 1 | x).
6.3 Hidden Unit
Rectified linear units are an excellent default choice of hidden unit.
Many other types of hidden units are available.
The design process consists of trial and error, intuiting that a kind of hidden unit may work well, and then
training a network with that kind of hidden unit and evaluating its performance on a validation set.
Rectified Linear Units and Their Generalizations
o Rectified linear units use the activation function g(z) = max{0, z}.
o Rectified linear units are easy to optimize because they are so similar to linear units.
o One drawback to rectified linear units is that they cannot learn via gradient- based methods on
examples for which their activation is zero.
A variety of generalizations of rectified linear units guarantee that they receive gradient every- where.
Three generalizations of rectified linear units are based on using a non-zero slope αi when zi < 0: hi = g(z,
α)i = max(0, zi) + αi min(0, zi).
Absolute value rectification fixes αi = −1 to obtain g( z) = |z|. It is used for object recognition
from images where it makes sense to seek features that are invariant under a polarity reversal of
the input illumination. Other generalizations of rectified linear units are more broadly applicable.
A leaky ReLU fixes αi to a small value like 0.01 while a parametric ReLU or PReLU treats αi as
a learnable parameter.
Maxout units generalize rectified linear units. Instead of applying an element-wise function
g(z ), maxout units divide z into groups of k values. Each maxout unit then outputs the maximum
element one of these groups:
where G(i) is the set of indices into the inputs for group i, {(i − 1)k + 1, . . . ,ik}.
Logistic Sigmoid and Hyperbolic Tangent
o Prior to the introduction of rectified linear units, most neural networks used the logistic sigmoid
activation function g(z) = σ(z) or the hyperbolic tangent activation function g(z) = tanh(z).
o These activation functions are closely related because tanh(z) = 2σ(2z) − 1.
6.4 Architecture Design
The word architecture refers to the overall structure of the network:
How many units it should have and how these units should be connected to each other.
Most neural networks are organized into groups of units called layers.
Most neural network architectures arrange these layers in a chain structure, with each layer being
a function of the layer that preceded it.
In this structure, the first layer
Universal Approximation Properties and Depth
The universal approximation theorem states that a feedforward network with a linear output layer
and at least one hidden layer with any “squashing” activation function
The derivatives of the feedforward network can also approximate the derivatives of the function
6.5 Back-Propagation and Other Differentiation Algorithms
The back-propagation algorithm often simply called backprop, allows the information from the cost to
then flow backwards through the network, in order to compute the gradient back-propagation refers only
to the method for computing the gradient, while another algorithm, such as stochastic gradient descent, is
used to perform learning using this gradient.
Computational Graphs
The back-propagation algorithm more precisely, it is helpful to have a more precise computational
graph language
Many ways of formalizing computation as graphs are possible
To formalize our graphs, we also need to introduce the idea of an operation.
An operation is a simple function of one or more variables.
Our graph language is accompanied by a set of allowable operations.
Functions more complicated than the operations in this set may be described by composing many
operations together.
Chain Rule of Calculus
The chain rule of calculus (not to be confused with the chain rule of probability) is used to
compute the derivatives of functions formed by composing other functions whose derivatives are
known.
Back-propagation is an algorithm that computes the chain rule, with a specific order of operations
that is highly efficient.
Figure: Examples of computational graphs
6.5.3 Recursively Applying the Chain Rule to Obtain Backprop
Using the chain rule, it is straightforward to write down an algebraic expression for the gradient of
a scalar with respect to any node in the computational graph that produced that scalar.
However, actually evaluating that expression in a computer introduces some extra considerations.
Example:
A procedure that performs the computations mapping ni inputs u (1) to u (ni) to an output u (n)
Back-Propagation Computation in Fully-Connected MLP
Which maps parameters to the supervised loss L(yˆ, y) associated with a single (input,target)
training example (x, y), with ˆy the output of the neural network when x is provided in input.
Symbol-to-Symbol Derivatives
Algebraic expressions and computational graphs both operate on symbols, or variables that
do not have specific values.
These algebraic and graph-based representations are called symbolic representations
General Back-Propagation
The back-propagation algorithm is very simple.
To compute the gradient of some scalar z with respect to one of its ancestors x in the graph, we
begin by observing that the gradient with respect to z is given by dz/dz = 1.
We can then compute the gradient with respect to each parent of z in the graph by multiplying the
current gradient by the Jacobian of the operation that produced.
Each node in the graph G corresponds to a variable get_operation(V) : This returns the operation
that computes V, represented by the edges coming into V in the computational graph
get_consumers(V, G): This returns the list of variables that are children of V in the
computational graph G.
get_inputs(V, G): This returns the list of variables that are parents of V in the computational
graph G