KEMBAR78
Unit 2 Deep Learning | PDF | Artificial Neural Network | Computational Science
0% found this document useful (0 votes)
98 views19 pages

Unit 2 Deep Learning

Uploaded by

sahil.utube2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views19 pages

Unit 2 Deep Learning

Uploaded by

sahil.utube2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Unit 2

Feed Forward Neural Network


&
Gradient Descent

Feed forward network in deep learning

It's a form of a feedforward neural network, in which the connections between


the nodes do not form a loop. It accepts multiple inputs, each input is multiplied
by a weight, and the products are added up.

How Feedforward Neural Networks Work

Feedforward neural networks were among the first and most successful learning
algorithms. They are also called deep networks, multi-layer perceptron (MLP),
or simply neural networks. Feedforward neural networks are made up of the
following:
 Input layer: This layer consists of the neurons that receive inputs and
pass them on to the other layers. The number of neurons in the input layer
should be equal to the attributes or features in the dataset.
 Output layer: The output layer is the predicted feature and depends on
the type of model you’re building.
 Hidden layer: In between the input and output layer, there are hidden
layers based on the type of model. Hidden layers contain a vast number of
neurons which apply transformations to the inputs before passing them.
As the network is trained, the weights are updated to be more predictive.
 Neuron weights: Weights refer to the strength or amplitude of a
connection between two neurons. If you are familiar with linear
regression, you can compare weights on inputs like coefficients. Weights
are often initialized to small random values, such as values in the range 0
to 1.

 Activation Function:

Neurons are responsible for making decisions in this area.

 According to the activation function, the neurons determine whether to


make a linear or nonlinear decision. Since it passes through so many
layers, it prevents the cascading effect from increasing neuron outputs.
 An activation function can be classified into three major categories:
sigmoid, Tanh, and Rectified Linear Unit (ReLu).

1-Sigmoid:

 Input values between 0 and 1 get mapped to the output values.

2-Tanh:

 A value between -1 and 1 gets mapped to the input values.

3-Rectified linear Unit:

Only positive values are allowed to flow through this function. Negative values
get mapped to 0.

 To better understand how feedforward neural network’s function, let us


solve a simple problem — predicting if it is raining or not when given
three inputs.

 x1 - day/night
 x2 - temperature
 x3 - month

 Let us assume the threshold value to be 20, and if the output is higher
than 20 then it will be raining, otherwise it is a sunny day.
 Given a data tuple with inputs (x1, x2, x3) as (0, 12, 11), initial weights
of the feedforward network (w1, w2, w3) as (0.1, 1, 1) and biases as (1, 0,
0).

 Here is how the neural network computes the data in three simple steps:

1. Multiplication of weights and inputs: The input is multiplied by the


assigned weight values, which this case would be the following:

(x1* w1) = (0 * 0.1) = 0

(x2* w2) = (1 * 12) = 12

(x3* w3) = (11 * 1) = 11

2. Adding the biases: In the next step, the product found in the previous step is
added to their respective biases. The modified inputs are then summed up to a
single value.

(x1* w1) + b1 = 0 + 1

(x2* w2) + b2 = 12 + 0

(x3* w3) + b3 = 11 + 0

weighted sum = (x1* w1) + b1 + (x2* w2) + b2 + (x3* w3) + b3 = 23

3. Activation: An activation function is the mapping of summed weighted input


to the output of the neuron. It is called an activation/transfer function because it
governs the inception at which the neuron is activated and the strength of the
output signal.

4. Output signal: Finally, the weighted sum obtained is turned into an output
signal by feeding the weighted sum into an activation function (also called
transfer function). Since the weighted sum in our example is greater than 20, the
perceptron predicts it to be a rainy day.

The image below illustrates this process more clearly.


Calculating the Loss

In simple terms, a loss function quantifies how “good” or “bad” a given model
is in classifying the input data. In most learning networks, the loss is calculated
as the difference between the actual output and the predicted output.

Mathematically:

loss = y_{predicted} - y_{original}

The function that is used to compute this error is known as loss function J(.).
Different loss functions will return different errors for the same prediction,
having a considerable effect on the performance of the model.

Backpropagation

 The predicted value of the network is compared to the expected output,


and an error is calculated using a function. This error is then propagated
back within the whole network, one layer at a time, and the weights are
updated according to the value that they contributed to the error.
 This clever bit of math is called a backpropagation algorithm. The
process is repeated for all the examples in the training data.
 One round of updating the network for the entire training dataset is called
an epoch. A network may be trained for tens, hundreds, or many
thousands of epochs.

What is a feed forward neural network?

Feed forward neural networks are artificial neural networks in which nodes do
not form loops. This type of neural network is also known as a multi-layer
neural network as all information is only passed forward.

During data flow, input nodes receive data, which travel through hidden layers,
and exit output nodes. No links exist in the network that could get used to by
sending information back from the output node.

A feed forward neural network approximates functions in the following way:

 An algorithm calculates classifiers by using the formula y = f* (x).


 Input x is therefore assigned to category y.
 According to the feed forward model, y = f (x; θ). This value determines
the closest approximation of the function.

What is the working principle of a feed forward neural network?

 When the feed forward neural network gets simplified, it can appear as a
single layer perceptron.
 This model multiplies inputs with weights as they enter the layer.
Afterward, the weighted input values get added together to get the sum.
 As long as the sum of the values rises above a certain threshold, set at
zero, the output value is usually 1, while if it falls below the threshold, it
is usually -1.
 As a feed forward neural network model, the single-layer perceptron
often gets used for classification. Machine learning can also get
integrated into single-layer perceptron’s.
 As a result of training and learning, gradient descent occurs. Similarly,
multi-layered perceptron’s update their weights. But this process gets
known as back-propagation.
 If this is the case, the network's hidden layers will get adjusted according
to the output values produced by the final layer.

Understanding the math behind neural networks

 Neurons are hexagons in this image. In neural networks, neurons get


arranged into layers: input is the first layer, and output is the last with the
hidden layer in the middle.
 Neural Network (NN) consists of two main elements that compute
mathematical operations. Neurons calculate weighted sums using input
data and synaptic weights since neural networks are just mathematical
computations based on synaptic links.

The following is a simplified visualization:


In a matrix format, it looks as follows:

In the third step, a vector of ones gets multiplied by the output of our hidden
layer.

Gradient Descent
 Gradient descent is an optimization algorithm which is commonly-used to
train machine learning models and neural networks.
 Training data helps these models learn over time, and the cost function
within gradient descent specifically acts as a barometer, gauging its
accuracy with each iteration of parameter updates.
 Until the function is close to or equal to zero, the model will continue to
adjust its parameters to yield the smallest possible error.
 Once machine learning models are optimized for accuracy, they can be
powerful tools for artificial intelligence (AI) and computer science
applications.

How does gradient descent work?


 As the equation of Linear regression for the slope of a line, which is
y = mx + b
 where m represents the slope and b is the intercept on the y-axis.
 You may also recall plotting a scatterplot in statistics and finding the line
of best fit, which required calculating the error between the actual output
and the predicted output (y-hat) using the mean squared error formula.
The gradient descent algorithm behaves similarly, but it is based on a
convex function.
 The starting point is just an arbitrary point for us to evaluate the
performance. From that starting point, we will find the derivative (or
slope), and from there, we can use a tangent line to observe the steepness
of the slope.
 The slope will inform the updates to the parameters—i.e., the weights and
bias.
 The slope at the starting point will be steeper, but as new parameters are
generated, the steepness should gradually reduce until it reaches the
lowest point on the curve, known as the point of convergence.
 Similar to finding the line of best fit in linear regression, the goal of
gradient descent is to minimize the cost function, or the error between
predicted and actual y.
 In order to do this, it requires two data points—a direction and a learning
rate.
 These factors determine the partial derivative calculations of future
iterations, allowing it to gradually arrive at the local or global minimum
(i.e., point of convergence).

 Learning rate (also referred to as step size or the alpha) is the size of the
steps that are taken to reach the minimum. This is typically a small value,
and it is evaluated and updated based on the behaviour of the cost
function.
 High learning rates result in larger steps but risks overshooting the
minimum.
 Conversely, a low learning rate has small step sizes. While it has the
advantage of more precision, the number of iterations compromises
overall efficiency as this takes more time and computations to reach the
minimum.

 The cost (or loss) function measures the difference, or error, between
actual y and predicted y at its current position. This improves the machine
learning model's efficacy by providing feedback to the model so that it
can adjust the parameters to minimize the error and find the local or
global minimum.
 It continuously iterates, moving along the direction of steepest descent
(or the negative gradient) until the cost function is close to or at zero. At
this point, the model will stop learning.

Challenges with gradient descent


Local minima and saddle points

 For convex problems, gradient descent can find the global minimum with
ease, but as nonconvex problems emerge, gradient descent can struggle to
find the global minimum, where the model achieves the best results.
 Recall that when the slope of the cost function is at or close to zero, the
model stops learning. A few scenarios beyond the global minimum can
also yield this slope, which are local minima and saddle points.
 Local minima mimic the shape of a global minimum, where the slope of
the cost function increases on either side of the current point.
 However, with saddle points, the negative gradient only exists on one
side of the point, reaching a local maximum on one side and a local
minimum on the other. Noisy gradients can help the gradient escape local
minimums and saddle points.
Vanishing and Exploding Gradients

 Vanishing gradients: This occurs when the gradient is too small. As we


move backwards during backpropagation, the gradient continues to
become smaller, causing the earlier layers in the network to learn more
slowly than later layers. When this happens, the weight parameters update
until they become insignificant—i.e. 0—resulting in an algorithm that is
no longer learning.
 Exploding gradients: This happens when the gradient is too large,
creating an unstable model. In this case, the model weights will grow too
large, and they will eventually be represented as NaN. One solution to
this issue is to leverage a dimensionality reduction technique, which can
help to minimize complexity within the model.
 If we move towards a negative gradient or away from the gradient of the
function at the current point, it will give the local minimum of that
function.
 Whenever we move towards a positive gradient or towards the gradient of
the function at the current point, we will get the local maximum of that
function.

How does Gradient Descent work?

The equation for simple linear regression is given as:

Y=mX+c

Where 'm' represents the slope of the line, and 'c' represents the intercepts on the
y-axis.
 The starting point (shown in above fig.) is used to evaluate the
performance as it is considered just as an arbitrary point.
 At this starting point, we will derive the first derivative or slope and then
use a tangent line to calculate the steepness of this slope. Further, this
slope will inform the updates to the parameters (weights and bias).
 The slope becomes steeper at the starting point or arbitrary point, but
whenever new parameters are generated, then steepness gradually
reduces, and at the lowest point, it approaches the lowest point, which is
called a point of convergence.
 The main objective of gradient descent is to minimize the cost function or
the error between expected and actual. To minimize the cost function,
two data points are required.

Momentum Based Gradient Descent

 Gradient descent with momentum will always work much faster than the
algorithm Standard Gradient Descent. The basic idea of Gradient Descent
with momentum is to calculate the exponentially weighted average of
your gradients and then use that gradient instead to update your weights.
It functions faster than the regular algorithm for the gradient descent.
 Momentum-based gradient descent is a gradient descent optimization
algorithm variant that adds a momentum term to the update rule.
 The momentum term is computed as a moving average of the past
gradients, and the weight of the past gradients is controlled by a
hyperparameter called Beta.
 The momentum term helps to accelerate the optimization process by
allowing the updates to build up in the direction of the steepest descent.
 This can help to address some of the problems with gradient descent,
such as oscillations, slow convergence, and getting stuck in local minima.
By using momentum-based gradient descent, it is possible to train
machine learning models more efficiently and achieve better
performance.
 We can use gradient descent with momentum to address some of these
problems. It works by adding a fraction of the previous weight update to
the current weight update so that the optimization algorithm can build
momentum as it descends the loss function. This can help the algorithm
escape from local minima and saddle points and can also help the
algorithm converge faster by avoiding oscillations.

Momentum helps to,

 Escape local minima and saddle points


 Aids in faster convergence by reducing oscillations
 Smooths out weight updates for stability
 Reduces model complexity and prevents overfitting
 Can be used in combination with other optimization algorithms for
improved performance.

How can this be Used and Applied to Gradient Descent?

To apply gradient descent with momentum, you can update the weights as
follows:

v = rho * v + learning_rate * gradient

weights = v

Where v is the momentum term, rho is the momentum hyperparameter


(typically set to 0.9), learning rate is the learning rate, and the gradient is the
gradient of the loss function to the weights.

It's important to note that momentum is just one of many techniques we can use
to improve the convergence of gradient descent. Other techniques include
Nesterov momentum, adaptive learning rates, and mini-batch gradient descent.

How does Momentum-Based Gradient Descent Work?


Gradient descent with momentum is an optimization algorithm that helps
accelerate the convergence of gradient descent by adding a momentum term to
the weight update. The momentum term is based on the previous weight update
and helps the algorithm build momentum as it descends the loss function.
Here is the math behind momentum-based gradient descent:

Let's say we have a set of weights w and a loss function L, and we want to use
gradient descent to find the weights that minimize the loss function. The
standard gradient descent update rule is:

w = w - alpha * gradient

Where alpha is the learning rate, and the gradient is the gradient of the loss
function to the weights.

To incorporate gradient descent with momentum into this update rule, we can
add a momentum term v that is based on the previous weight update:

v = rho * v + alpha * gradient

w=w-v

Where rho is the momentum hyperparameter (typically set to 0.9).

The momentum term v can be interpreted as the "velocity" of the optimization


algorithm, and it helps the algorithm build momentum as it descends the loss
function. This can help the algorithm escape from local minima and saddle
points and can also help the algorithm converge faster by avoiding oscillations.

Nesterov Accelerated Gradient Descent (NAG)

 As can see, in the momentum-based gradient, the steps become larger and
larger due to the accumulated momentum, and then we overshoot at the
4th step. We then must take steps in the opposite direction to reach the
minimum point.
 However, the update in NAG happens in two steps.
 First, a partial step to reach the look-ahead point, and then the final
update. We calculate the gradient at the look-ahead point and then use it
to calculate the final update.
 If the gradient at the look-ahead point is negative, our final update will be
smaller than that of a regular momentum-based gradient.
 Like in the above example, the updates of NAG are similar to that of the
momentum-based gradient for the first three steps because the gradient at
that point and the look-ahead point are positive. But at step 4, the gradient
of the look-ahead point is negative.
 In NAG, the first partial update 4a will be used to go to the look-ahead
point and then the gradient will be calculated at that point without
updating the parameters. Since the gradient at step 4b is negative, the
overall update will be smaller than the momentum-based gradient
descent.
 We can see in the above example that the momentum-based gradient
descent takes six steps to reach the minimum point, while NAG takes
only five steps.
 This looking ahead helps NAG to converge to the minimum points in
fewer steps and reduce the chances of overshooting.

We saw how NAG solves the problem of overshooting by ‘looking ahead’. Let
us see how this is calculated and the actual math behind it.

Update rule for gradient descent:


wt+1 = wt − η∇wt
In this equation, the weight (W) is updated in each iteration. η is the learning
rate, and ∇wt is the gradient.

Update rule for momentum-based gradient descent:


In this, momentum is added to the conventional gradient descent equation. The
update equation is
wt+1 = wt − updatet
updatet is calculated by:
updatet = γ · updatet−1 + η∇wt

γ=Gamma is a momentum parameter which determines the contribution of the


previous update step to current steps. Values between 0 and 1.Tuypically close
to 0.9 or 0.99.

This is how the gradient of all the previous updates is added to the current
update.

Update rule for NAG:


wt+1 = wt − updatet
While calculating the updatet, we will include the look ahead gradient
(∇wlook_ahead).
updatet = γ · updatet−1 + η∇wlook_ahead

∇wlook_ahead is calculated by:


wlook_ahead = wt − γ · updatet−1

This look-ahead gradient will be used in our update and will prevent
overshooting.

Adagrad (Adaptive Gradient Algorithm)

 Whatever the optimizer we learned till SGD with momentum, the


learning rate remains constant. In Adagrad optimizer, there is no
momentum concept so, it is much simpler compared to SGD with
momentum.
 The idea behind Adagrad is to use different learning rates for each
parameter base on iteration.
 The reason behind the need for different learning rates is that the learning
rate for sparse features parameters needs to be higher compare to the
dense features parameter because the frequency of occurrence of sparse
features is lower.

Equation:

 In the above Adagrad optimizer equation, the learning rate has been
modified in such a way that it will automatically decrease because the
summation of the previous gradient square will always keep on increasing
after every time step.
 Now, let us take a simple example to check how the learning rate is
different for every parameter in a single time step. For this example, we
will consider a single neuron with 2 inputs and 1 output. So, the total
number of parameters will be 3 including bias.

 The above computation is done at a single time step, where all the three
parameters learning rate “η” is divided by the square root of “α” which is
different for all parameters.
 So, we can see that the learning rate is different for all three parameters.
RMSProp (Root Mean Square Propagation) Optimize

 RMSProp is an extension of Adagrad that attempts to solve its radically


diminishing learning rates.
 The idea behind RMSProp is that instead of summing up all the past
squared gradients from 1 to “t” time steps, what if we could restrict the
window size.
 For example, computing the squared gradient of the past 10 gradients and
average out. This can be achieved using Exponentially Weighted
Averages over Gradient.

 The above equation shows that as the time steps “t” increase the
summation of squared gradients “α” increases which led to a decrease in
learning rate “η”.
 In order to resolve the exponential increase in the summation of squared
gradients “α”, we replaced the “α” with exponentially weighted averages
of squared gradients.
 So, here unlike the alpha “α” in Adagrad, where it increases exponentially
after every time step.
 The typical “β” value is 0.9 or 0.95.

You might also like