100% found this document useful (1 vote)

95 views13 pages

9.deep Feedforward Networks

Deep Learning

Uploaded by

Kavitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

95 views13 pages

9.deep Feedforward Networks

Deep Learning

Uploaded by

Kavitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

9.

DEEP NETWORKS

9.1 Deep Feed forward Networks

Deep feedforward networks also often called feedforward neural networks, or multilayer perceptrons

approximate some function f ∗.

(MLPs), are the quintessential deep learning models. The goal of a feed forward network is to

For example:

 For a classifier, y = f ∗(x) maps an input x to a category y.

 A feed forward network defines a mapping y = f(x; θ) and learns the value of the parameters θ that
result in the best function approximation.

 These models are called feed forward because information flows through the function being
evaluated from x, through the intermediate computations used to define f, and finally to the output
y.

 There are no feedback connections in which outputs of the model are fed back into itself.
 When feed forward neural networks are extended to include feedback connections, they are called
recurrent neural networks
 Feed forward networks are of extreme importance to machine learning practitioners.
 Feed forward neural networks are called networks because they are typically represented by
composing together many different functions.

The model is associated with a directed acyclic graph describing how the functions are composed
together.

For example:

o We might have three functions f (1) , f (2) , and f (3) connected in a chain, to form
f(x) = f (3)(f (2)(f (1)(x))).

o These chain structures are the most commonly used structures of neural networks.

o In this case, f (1) is called the first layer of the network, f (2) is called the second layer, and so on.

o The overall length of the chain gives the depth of the model. It is from this terminology that the
name “deep learning” arises.

training, we drive f(x) to match f ∗(x).

o The final layer of a feedforward network is called the output layer. During neural network
6.1 Example: Learning XOR

To make the idea of a feedforward network more concrete, we begin with an example of a fully
functioning feedforward network on a very simple task: learning the XOR function.

The XOR function (“exclusive or”) is an operation on two binary values, x1 and x2.

When exactly one of these binary values is equal to 1, the XOR function returns 1.

Otherwise, it returns 0. The XOR function provides the target function y = f ∗(x) that we want to learn.

Our model provides a function y = f(x;θ) and our learning algorithm will adapt the parameters θ to make f
as similar as possible to f ∗.

Our network to perform correctly on the four points X = {[0, 0]T, [0,1]T, [1, 0]T, and [1, 1]T}. We will train
the network on all four of these points. The only challenge is to fit the training set.

We can treat this problem as a regression problem and use a mean squared error loss function.

We choose this loss function to simplify the math for this example as much as possible.

Evaluated on our whole training set, the MSE loss function is

Figure : Solving the XOR problem by learning a representation

Figure: An example of a feedforward network, drawn in two different styles

o In modern neural networks, the default recommendation is to use the rectified linear unit or ReLU
defined by the activation function g(z) = max{0, z} depicted in figure
Figure : The rectified linear activation function
6.2 Gradient-Based Learning

 For feedforward neural networks, it is important to initialize all weights to small random values.

 The biases may be initialized to zero or to small positive values.

 The iterative gradient-based optimization algorithms used to train feedforward networks and
almost all other deep models.

a) Cost Functions:

 An important aspect of the design of a deep neural network is the choice of the cost
function.

 Fortunately, the cost functions for neural networks are more or less the same as those for
other parametric models, such as linear models.

 In most cases, our parametric model defines a distribution p(y | x;θ ) and we simply use the
principle of maximum likelihood.

 This means we use the cross-entropy between the training data and the model’s predictions
as the cost function.

Learning Conditional Distributions with Maximum Likelihood:

 Most modern neural networks are trained using maximum likelihood.

 This means that the cost function is simply the negative log-likelihood, equivalently describe as
the cross-entropy between the training data and the model distribution.

 This cost function is given by

up to a scaling factor of 1 2 and a term that does not depend on θ

Learning Conditional Statistics:

 Instead of learning a full probability distribution p(y | x; θ) we often want to learn just one
conditional statistic of y given.
For example:

o We can design the cost functional to have its minimum lie on the function that maps x to the
expected value of y given x.

o Solving an optimization problem with respect to a function requires a mathematical tool called
calculus of variations

Output Units

 The choice of cost function is tightly coupled with the choice of output unit.

i) Linear Units for Gaussian Output Distributions

o One simple kind of output unit is an output unit based on an affine transformation with no
nonlinearity.

o These are often just called linear units. Given features h, a layer of linear output units produces a
vector yˆ = WTh+b.

o Linear output layers are often used to produce the mean of a conditional Gaussian distribution:

ii) Sigmoid Units for Bernoulli Output Distributions:

o The maximum-likelihood approach is to define a Bernoulli distribution over y conditioned on x

o A Bernoulli distribution is defined by just a single number.

o The neural net needs to predict only P(y = 1 | x).

o For this number to be a valid probability, it must lie in the interval [0, 1].

o Satisfying this constraint requires some careful design effort.

Suppose we were to use a linear unit, and threshold its value to obtain a valid probability:

where σ is the logistic sigmoid function

iii) Softmax Units for Multinoulli Output Distributions

o Softmax functions are most often used as the output of a classifier, to represent the probability
distribution over n different classes

o In case of binary variables, we wished to produce a single number yˆ = P(y = 1 | x).

6.3 Hidden Unit

Rectified linear units are an excellent default choice of hidden unit.

Many other types of hidden units are available.

The design process consists of trial and error, intuiting that a kind of hidden unit may work well, and then
training a network with that kind of hidden unit and evaluating its performance on a validation set.

Rectified Linear Units and Their Generalizations

o Rectified linear units use the activation function g(z) = max{0, z}.

o Rectified linear units are easy to optimize because they are so similar to linear units.

o One drawback to rectified linear units is that they cannot learn via gradient- based methods on
examples for which their activation is zero.
A variety of generalizations of rectified linear units guarantee that they receive gradient every- where.

Three generalizations of rectified linear units are based on using a non-zero slope αi when zi < 0: hi = g(z,
α)i = max(0, zi) + αi min(0, zi).

 Absolute value rectification fixes αi = −1 to obtain g( z) = |z|. It is used for object recognition
from images where it makes sense to seek features that are invariant under a polarity reversal of
the input illumination. Other generalizations of rectified linear units are more broadly applicable.

 A leaky ReLU fixes αi to a small value like 0.01 while a parametric ReLU or PReLU treats αi as
a learnable parameter.

 Maxout units generalize rectified linear units. Instead of applying an element-wise function
g(z ), maxout units divide z into groups of k values. Each maxout unit then outputs the maximum
element one of these groups:

where G(i) is the set of indices into the inputs for group i, {(i − 1)k + 1, . . . ,ik}.

 Logistic Sigmoid and Hyperbolic Tangent

o Prior to the introduction of rectified linear units, most neural networks used the logistic sigmoid
activation function g(z) = σ(z) or the hyperbolic tangent activation function g(z) = tanh(z).

o These activation functions are closely related because tanh(z) = 2σ(2z) − 1.

6.4 Architecture Design

 The word architecture refers to the overall structure of the network:

 How many units it should have and how these units should be connected to each other.

 Most neural networks are organized into groups of units called layers.

 Most neural network architectures arrange these layers in a chain structure, with each layer being
a function of the layer that preceded it.
In this structure, the first layer

 Universal Approximation Properties and Depth

 The universal approximation theorem states that a feedforward network with a linear output layer
and at least one hidden layer with any “squashing” activation function

 The derivatives of the feedforward network can also approximate the derivatives of the function

6.5 Back-Propagation and Other Differentiation Algorithms

The back-propagation algorithm often simply called backprop, allows the information from the cost to
then flow backwards through the network, in order to compute the gradient back-propagation refers only
to the method for computing the gradient, while another algorithm, such as stochastic gradient descent, is
used to perform learning using this gradient.

Computational Graphs

 The back-propagation algorithm more precisely, it is helpful to have a more precise computational
graph language

 Many ways of formalizing computation as graphs are possible

 To formalize our graphs, we also need to introduce the idea of an operation.

 An operation is a simple function of one or more variables.

 Our graph language is accompanied by a set of allowable operations.

 Functions more complicated than the operations in this set may be described by composing many
operations together.

 Chain Rule of Calculus

 The chain rule of calculus (not to be confused with the chain rule of probability) is used to
compute the derivatives of functions formed by composing other functions whose derivatives are
known.

 Back-propagation is an algorithm that computes the chain rule, with a specific order of operations
that is highly efficient.

Figure: Examples of computational graphs

6.5.3 Recursively Applying the Chain Rule to Obtain Backprop

 Using the chain rule, it is straightforward to write down an algebraic expression for the gradient of
a scalar with respect to any node in the computational graph that produced that scalar.

 However, actually evaluating that expression in a computer introduces some extra considerations.

Example:

A procedure that performs the computations mapping ni inputs u (1) to u (ni) to an output u (n)

 Back-Propagation Computation in Fully-Connected MLP

 Which maps parameters to the supervised loss L(yˆ, y) associated with a single (input,target)
training example (x, y), with ˆy the output of the neural network when x is provided in input.
Symbol-to-Symbol Derivatives

 Algebraic expressions and computational graphs both operate on symbols, or variables that
do not have specific values.

 These algebraic and graph-based representations are called symbolic representations

General Back-Propagation

 The back-propagation algorithm is very simple.

 To compute the gradient of some scalar z with respect to one of its ancestors x in the graph, we
begin by observing that the gradient with respect to z is given by dz/dz = 1.

 We can then compute the gradient with respect to each parent of z in the graph by multiplying the
current gradient by the Jacobian of the operation that produced.

 Each node in the graph G corresponds to a variable get_operation(V) : This returns the operation
that computes V, represented by the edges coming into V in the computational graph

 get_consumers(V, G): This returns the list of variables that are children of V in the
computational graph G.

 get_inputs(V, G): This returns the list of variables that are parents of V in the computational
graph G

Neural Network Optimization Guide
No ratings yet
Neural Network Optimization Guide
51 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
18 pages
Computational Graphs in Deep Learning Unit v4 Deep Leaerning
No ratings yet
Computational Graphs in Deep Learning Unit v4 Deep Leaerning
3 pages
Module I
No ratings yet
Module I
109 pages
21CSC305P ML - Unit 1-E
No ratings yet
21CSC305P ML - Unit 1-E
137 pages
Artificial Neural Network - Hopfield Networks - Tutorialspoint
No ratings yet
Artificial Neural Network - Hopfield Networks - Tutorialspoint
3 pages
Unit IV V Deep Learning Material
No ratings yet
Unit IV V Deep Learning Material
32 pages
ML-5TH Unit
No ratings yet
ML-5TH Unit
28 pages
Experiment No. 4 TE SL-II (ANN)
100% (1)
Experiment No. 4 TE SL-II (ANN)
2 pages
Unit 2
No ratings yet
Unit 2
64 pages
Single Layer Perceptron Classifiers
No ratings yet
Single Layer Perceptron Classifiers
52 pages
Soft Computing UNIT 3
No ratings yet
Soft Computing UNIT 3
10 pages
There are mainly three asymptotic notations: 1.Big-O Notation (O-notation) 2.Omega Notation (Ω-notation) 3.Theta Notation (Θ-notation)
No ratings yet
There are mainly three asymptotic notations: 1.Big-O Notation (O-notation) 2.Omega Notation (Ω-notation) 3.Theta Notation (Θ-notation)
15 pages
Module 4
No ratings yet
Module 4
18 pages
Computational Learning Theory Guide
No ratings yet
Computational Learning Theory Guide
24 pages
Neural Networks & SVMs in AI
No ratings yet
Neural Networks & SVMs in AI
19 pages
Lab Manual Soft Computing
No ratings yet
Lab Manual Soft Computing
44 pages
Unit I
No ratings yet
Unit I
41 pages
Unit 4
No ratings yet
Unit 4
38 pages
Subnetting in Networking - Subnetting Examples - Gate Vidyalay
No ratings yet
Subnetting in Networking - Subnetting Examples - Gate Vidyalay
13 pages
Constraint Satisfaction Problems: AIMA: Chapter 6
No ratings yet
Constraint Satisfaction Problems: AIMA: Chapter 6
64 pages
DBMS - Unit-3
No ratings yet
DBMS - Unit-3
35 pages
NNunit 2
No ratings yet
NNunit 2
25 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Perceptons Neural Networks
No ratings yet
Perceptons Neural Networks
33 pages
Artificial Intelligence Artificial Neural Networks - : Introduction
No ratings yet
Artificial Intelligence Artificial Neural Networks - : Introduction
43 pages
Module 5
No ratings yet
Module 5
31 pages
Soft Computing Assignment
100% (1)
Soft Computing Assignment
13 pages
Reinforcement Learning & MDPs
100% (1)
Reinforcement Learning & MDPs
8 pages
DL Question Bank Answers
No ratings yet
DL Question Bank Answers
55 pages
Lab5 Cache System
No ratings yet
Lab5 Cache System
3 pages
ML Lab Manual (5cs4-23)
No ratings yet
ML Lab Manual (5cs4-23)
53 pages
Unit I Deeplearning
No ratings yet
Unit I Deeplearning
13 pages
Gujarat Technological University: Computer Engineering Machine Learning SUBJECT CODE: 3710216
No ratings yet
Gujarat Technological University: Computer Engineering Machine Learning SUBJECT CODE: 3710216
2 pages
Greedy-Layerwise in Deep Learning
No ratings yet
Greedy-Layerwise in Deep Learning
15 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
96 pages
Project Management NOtes For Makaut
No ratings yet
Project Management NOtes For Makaut
65 pages
DL Unit 3 Notes PPT
No ratings yet
DL Unit 3 Notes PPT
37 pages
Ad3351 Daa Question Bank
No ratings yet
Ad3351 Daa Question Bank
12 pages
Neural Networks Bias
No ratings yet
Neural Networks Bias
7 pages
Backpropagation Examples PDF
No ratings yet
Backpropagation Examples PDF
9 pages
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
61 pages
Deep Learning CNN Training Guide
No ratings yet
Deep Learning CNN Training Guide
20 pages
Introduction To Time Series Analysis
No ratings yet
Introduction To Time Series Analysis
93 pages
10 Reasoning
100% (1)
10 Reasoning
18 pages
AI Decision-Making Under Uncertainty
No ratings yet
AI Decision-Making Under Uncertainty
10 pages
Machine Learning Deep Learning
No ratings yet
Machine Learning Deep Learning
2 pages
UNIT II - Compressed
No ratings yet
UNIT II - Compressed
14 pages
M.tech DL
No ratings yet
M.tech DL
221 pages
Introduction of Neural Network
100% (1)
Introduction of Neural Network
69 pages
Unit 2
No ratings yet
Unit 2
72 pages
Deep Learning Question Bank Iv-I
No ratings yet
Deep Learning Question Bank Iv-I
5 pages
UNIT1
No ratings yet
UNIT1
38 pages
UNIT3
No ratings yet
UNIT3
17 pages
Hopfield Neural Network
100% (1)
Hopfield Neural Network
6 pages
Algorithms and Data Structures: Dynamic Programming Matrix-Chain Multiplication
No ratings yet
Algorithms and Data Structures: Dynamic Programming Matrix-Chain Multiplication
17 pages
Chap 7-2 Regularization For Deep Learning-Hyun-Lim Yang
No ratings yet
Chap 7-2 Regularization For Deep Learning-Hyun-Lim Yang
49 pages
Soft Computing Semester 30-Apr-2024 19-05-12
No ratings yet
Soft Computing Semester 30-Apr-2024 19-05-12
5 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
33 pages
DL Unit 1
No ratings yet
DL Unit 1
10 pages
daa unit 1
No ratings yet
daa unit 1
33 pages
DLV Lab Manual Program
No ratings yet
DLV Lab Manual Program
24 pages
DAA unit 2
No ratings yet
DAA unit 2
33 pages
NLP Using RNN
No ratings yet
NLP Using RNN
15 pages
CNN Face Recognition
No ratings yet
CNN Face Recognition
6 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
18 pages
7& 9 Autoencoder and Variational Autoencoder
No ratings yet
7& 9 Autoencoder and Variational Autoencoder
13 pages
Memory Technology
No ratings yet
Memory Technology
26 pages
Trends in Power and Energy in Integrated Circuits
No ratings yet
Trends in Power and Energy in Integrated Circuits
21 pages
Depende Bali Ty
No ratings yet
Depende Bali Ty
9 pages
5.hyperparameters and Validation Sets (C)
0% (1)
5.hyperparameters and Validation Sets (C)
3 pages
Questions 11: Feed-Forward Neural Networks: Roman Belavkin Middlesex University
No ratings yet
Questions 11: Feed-Forward Neural Networks: Roman Belavkin Middlesex University
7 pages
Exam Killer
100% (1)
Exam Killer
246 pages
Machine Learning, ML Ass 6
No ratings yet
Machine Learning, ML Ass 6
11 pages
Smart Rice Cooker Report
0% (1)
Smart Rice Cooker Report
24 pages
Dossi Kojon - M. Zafar Iqbal
33% (3)
Dossi Kojon - M. Zafar Iqbal
7 pages
Deep Learning MCQ
No ratings yet
Deep Learning MCQ
7 pages
1995 - Neural Network Approach To Fault Classification
No ratings yet
1995 - Neural Network Approach To Fault Classification
10 pages
Chapter 1
No ratings yet
Chapter 1
27 pages
Data Mining Classification and Prediction
No ratings yet
Data Mining Classification and Prediction
17 pages
Advanced Machine Learning CIE
No ratings yet
Advanced Machine Learning CIE
13 pages
DL Unit 3
No ratings yet
DL Unit 3
14 pages
MATLAB Perceptron Model Guide
No ratings yet
MATLAB Perceptron Model Guide
5 pages
Besa Pruning Large Language Models With Blockwise Parameter-Efficient Sparsity Allocation 2402.16880v2
No ratings yet
Besa Pruning Large Language Models With Blockwise Parameter-Efficient Sparsity Allocation 2402.16880v2
15 pages
Modelling and Simulation of Gas Turbines: Hamid Asgari, Xiaoqi Chen, Raazesh Sainudiin
No ratings yet
Modelling and Simulation of Gas Turbines: Hamid Asgari, Xiaoqi Chen, Raazesh Sainudiin
15 pages
Comparison of Neural Networks With Traditional Machine Learning Models
No ratings yet
Comparison of Neural Networks With Traditional Machine Learning Models
20 pages
Neural Networks and Fuzzy Logic QB PDF
No ratings yet
Neural Networks and Fuzzy Logic QB PDF
9 pages
Master Spilak Bruno
No ratings yet
Master Spilak Bruno
73 pages
Wa0000
No ratings yet
Wa0000
52 pages
Single Layer Perceptron
No ratings yet
Single Layer Perceptron
14 pages
Behaviour Analysis of Multilayer Perceptronswith M
No ratings yet
Behaviour Analysis of Multilayer Perceptronswith M
7 pages
Dr. Meenakshi Sood Associate Professor, NITTTR Chandigarh: Meenkashi@nitttrchd - Ac.in
No ratings yet
Dr. Meenakshi Sood Associate Professor, NITTTR Chandigarh: Meenkashi@nitttrchd - Ac.in
39 pages
Lecture1 Introduction To ANN
No ratings yet
Lecture1 Introduction To ANN
70 pages
Unit 1 Datamining
No ratings yet
Unit 1 Datamining
16 pages
Webpdf
No ratings yet
Webpdf
671 pages
FYP Final Report - Robotic Arm
No ratings yet
FYP Final Report - Robotic Arm
89 pages
11 ANN (Backpropagation)
No ratings yet
11 ANN (Backpropagation)
37 pages
AI & Machine Learning Answer Key
No ratings yet
AI & Machine Learning Answer Key
13 pages
Introduction To Artificial Intelligence and Expert Systems
No ratings yet
Introduction To Artificial Intelligence and Expert Systems
14 pages
Radial Basis Function (RBF) Neural Networks For The Senior Design Project
No ratings yet
Radial Basis Function (RBF) Neural Networks For The Senior Design Project
17 pages

9.deep Feedforward Networks

Uploaded by

9.deep Feedforward Networks

Uploaded by

9.

9.1 Deep Feed forward Networks

approximate some function f ∗.

 For a classifier, y = f ∗(x) maps an input x to a category y.

training, we drive f(x) to match f ∗(x).

Evaluated on our whole training set, the MSE loss function is

Figure: An example of a feedforward network, drawn in two different styles

 The biases may be initialized to zero or to small positive values.

Learning Conditional Distributions with Maximum Likelihood:

 Most modern neural networks are trained using maximum likelihood.

 This cost function is given by

up to a scaling factor of 1 2 and a term that does not depend on θ

Learning Conditional Statistics:

i) Linear Units for Gaussian Output Distributions

ii) Sigmoid Units for Bernoulli Output Distributions:

o The maximum-likelihood approach is to define a Bernoulli distribution over y conditioned on x

o A Bernoulli distribution is defined by just a single number.

o The neural net needs to predict only P(y = 1 | x).

o Satisfying this constraint requires some careful design effort.

where σ is the logistic sigmoid function

iii) Softmax Units for Multinoulli Output Distributions

o In case of binary variables, we wished to produce a single number yˆ = P(y = 1 | x).

6.3 Hidden Unit

Rectified linear units are an excellent default choice of hidden unit.

Many other types of hidden units are available.

Rectified Linear Units and Their Generalizations

 Logistic Sigmoid and Hyperbolic Tangent

o These activation functions are closely related because tanh(z) = 2σ(2z) − 1.

6.4 Architecture Design

 The word architecture refers to the overall structure of the network:

 Universal Approximation Properties and Depth

6.5 Back-Propagation and Other Differentiation Algorithms

 Many ways of formalizing computation as graphs are possible

 To formalize our graphs, we also need to introduce the idea of an operation.

 An operation is a simple function of one or more variables.

 Our graph language is accompanied by a set of allowable operations.

 Chain Rule of Calculus

Figure: Examples of computational graphs

6.5.3 Recursively Applying the Chain Rule to Obtain Backprop

 Back-Propagation Computation in Fully-Connected MLP

 These algebraic and graph-based representations are called symbolic representations

 The back-propagation algorithm is very simple.

You might also like