KEMBAR78
DL Unit 1 | PDF | Artificial Neural Network | Deep Learning
0% found this document useful (0 votes)
246 views200 pages

DL Unit 1

Uploaded by

D44 SREETEJA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
246 views200 pages

DL Unit 1

Uploaded by

D44 SREETEJA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 200

Introduction to Deep

Learning

Dr Rajesh Thumma
Assoc Professor
Dept of ECE
Contents
• Course Objectives
• Course Outcomes
• Syllabus
• Fundamentals of deep learning
• Building Block of Neural Networks
• Layers- Single Layer perceptron and MLPs
• Forward pass & Backward pass
• Class, trainer and optimizer
• The Vanishing and Exploding Gradient Problems
• Difficulties in Convergence
• Local and Spurious Optima
• Momentum, learning rate Decay, Dropout
• Cross Entropy loss function.
Course Objectives
• To understand the concept of Deep Learning
• To understand various CNN Architectures
• To learn various RNN model
• To familiarize the concept of Autoencoder
• To apply Transfer Learning to solve problems
Course Outcomes
At the end of this course, students will be able to:
• Understand the fundamental issues and basics of deep learning
• Understand the concept of CNN to apply it in the Image classification
problems
• Analyze the various RNN methods for sequence of input and
Generative model for image generation
• Analyze the working of various the Autoencoders methods
• Use Transfer Learning to solve problems with high dimensional data
including image and speech
Syllabus
• UNIT-I Deep Learning: Fundamentals, Building Block of Neural Networks, Layers, MLPs,
Forward pass, backward pass, class, trainer and optimizer, The Vanishing and Exploding
Gradient Problems, Difficulties in Convergence, Local and Spurious Optima, Momentum,
learning rate Decay, Dropout, Cross Entropy loss function.
• UNIT-II Deep Learning: Activation functions, initialization, regularization, batch
normalization, model selection, ensembles. Convolutional neural networks: Fundamentals,
architectures, striding and padding, pooling layers, CNN -Case study with MNIST, CNN vs
Fully Connected.
• UNIT-III RNN: Handling Branches, Layers, Nodes, Essential Elements-Vanilla RNNs, GRUs,
LSTM, video to text with LSTM models.
• UNIT-IV Autoencoders and GAN: Basics of auto encoder, comparison between auto encoder
and PCA, variational auto encoders, denoising auto encoder, sparse auto encoder, vanilla auto
encoder, Multilayer autoencoder. Convolutional autoencoder, regularized auto encoder. GAN,
Image generation with GAN.
• UNIT-V Transfer Learning- Types, Methodologies, Diving into Transfer Learning, Challenges
What is Deep Learning?
Why do we need Deep Learning?
When to use Deep Learning or not over
others?
• Deep Learning out perform other techniques if the data size is large. But with
small data size, traditional Machine Learning algorithms are preferable.
• Deep Learning techniques need to have high end infrastructure to train in
reasonable time.
• When there is lack of domain understanding for feature introspection, Deep
Learning techniques outshines others as you have to worry less about feature
engineering.
• Deep Learning really shines when it comes to complex problems such as
image classification, natural language processing, and speech recognition.
Factors Machine Learning Deep Learning
Machine Learning is a subfield of AI
Deep Learning is a subfield of ML that focuses
that focuses on machines being able to
Definition on machines being able to mimic the human
learn without being explicitly
brain to perform highly complex AI problems.
programmed.

We give structured data to the machine We give unstructured data or you can say the raw
Data Feeding
that builds the ML model. input to the neural network.

ML models deal with datasets having Deep learning models mostly deal with datasets
Volume of Data
thousands of data rows. having millions of data rows.

ML models take less time to train It takes a huge amount of time because of
Training Time
because of the small data size. massive data points.

Deep learning models are difficult to build as


Machine learning models are easy to
they use complex multilayered neural networks
Human Involvement build but require more human
but they have the capability to learn by
interaction to make better predictions.
themselves.
Factors Machine Learning Deep Learning
Feature engineering is done explicitly No need of feature engineering, neural networks
Feature Engineering
by humans. automatically detect important features.

To mimic the human brain processing, how they


To give the output as close as it can be actually think. If somehow machines are able to
Goal
to the expected output. think that way they will automatically generate
the right output.

It is difficult to explain the results of a deep


It is easy to explain the results of an ML learning model since it’s hard to interpret the
Interpreting Results
model. output of a complex multi-layered neural
network.
ML models show good performance on Deep learning models show better performance
Performance
small and medium-sized datasets. on huge datasets.

Customer support, Image processing, Speech


Fraud detection, Recommendation
Applications recognition, Object recognition, Natural
systems, Pattern recognition, and so on.
language processing, computer vision, and so on.
Example of ML
Example Deep Learning
Top Applications of Deep Learning Across
Industries
• Self Driving Cars • Adding sounds to silent movies
• News Aggregation and Fraud News Detection• Automatic Machine Translation
• Natural Language Processing • Automatic Handwriting Generation
• Virtual Assistants • Automatic Game Playing
• Entertainment • Language Translations
• Visual Recognition • Pixel Restoration
• Fraud Detection • Photo Descriptions
• Healthcare • Demographic and Election Predictions
• Personalisations • Deep Dreaming
• Detecting Developmental Delay in Children • Colourisation of Black and White images
Applications of Deep Learning
Applications of Deep Learning
Fundamentals
• Weights and Biases: The knowledge of a neural network is stored in the weights and
biases that are learnt during the training process. These weights determine the influence
that a given input (or neuron) has on an output.

• Forward Propagation: Forward propagation is the process in which the neural network
makes its predictions. Starting from the input layer, it propagates the input through the
network, layer by layer, until it reaches the output layer. At each neuron, it multiplies the
input by the weights, adds the bias, and applies the activation function to generate the
output.

• Activation Function: The activation function decides whether a neuron should be


activated or not by transforming the weighted sum of the inputs and the bias. Common
activation functions include theSigmoid, Tanh, and ReLU.
Fundamentals
• Loss Function: A loss function measures the difference between the
predicted output of the neural network and the actual output during
training. The goal is to minimize this difference.

• Optimizer: An optimizer is an algorithm used to adjust the parameters


of your neural network, such as weights and learning rate, to reduce
losses.

• Backpropagation: This is a method used to calculate the gradient of the


loss function with respect to each weight in the neural network, which in
turn is used to update the weights and reduce the loss.
Fundamentals
• Overfitting: Overfitting occurs when a model learns the training data too
well, including its noise and outliers, and performs poorly on unseen
data.

• Underfitting: Underfitting is the opposite, where the model fails to learn


the underlying patterns of the data.

• Regularization: Techniques like dropout, weight decay, early stopping,


are used to prevent overfitting by adding a penalty to the loss function or
by altering the architecture of the neural network.
Building Blocks of a Neural Network
• There are two building blocks of a Neural Network: Layers and Neurons
• A neural network is made up of vertically stacked components
called Layers.
• There are three types of layers in a NN-
Building Blocks of a Neural Network
• Input Layer– First is the input layer. This layer will accept the data and pass it to the
rest of the network.

• Hidden Layer– The second type of layer is called the hidden layer. Hidden layers are
either one or more in number for a neural network. In the above case, the number is 1.
Hidden layers are the ones that are actually responsible for the excellent performance
and complexity of neural networks. They perform multiple functions at the same time
such as data transformation, automatic feature creation, etc.

• Output layer– The last type of layer is the output layer. The output layer holds the result
or the output of the problem.
Building Blocks of a Neural Network
• A layer consists of small individual units called neurons.
• An artificial neuron is similar to a biological neuron. It receives input from
the other neurons, performs some processing, and produces an output.
• Most simple neural network is the “perceptron”, which consists of a single
neuron.
• In biological neurons, the neuron receives electrical signals from its
Dendrites, modulate the electrical signals in various amounts, then fires an
output signal through its Synapses only when the total strength of the input
signals exceed a certain threshold. The output is then fed to another neuron
and so forth.
Perceptron
Perceptron
Perceptron
Perceptron
Perceptron
Perceptron
Biological Neuron
Artificial Neuron
Biological Neuron vs Artificial Neuron
Biological Neuron vs Artificial Neuron
• To model the biological neuron phenomenon, the artificial neuron
performs two consecutive functions:

1) calculates the weighted sum of the inputs to represent the total


strength of the input signals and

2) applies a step function to the result to determine whether to fire an


output 1 if the signal exceeds a certain threshold or 0 if the signal
doesn’t exceed the threshold.
• Not all input features are equally important (or useful features). Each
input feature (xi) is assigned its own weight (wi) that reflects its
importance in the decision making process.
In the perceptron diagram above, you can see the following:
1. Input vector
2. Weights vector
3. Neuron functions
4. Output
• The calculations that happen inside the neuron: 1) weighted sum
and 2) step function.
1) Weighted sum Function: Also known as linear combination. It
is the sum of all inputs multiplied by their weights then added to
a bias term.
This function produces a straight line represented in the
following equation:
• Step Activation Function: The activation function takes the same
weighted sum input from before, z =∑xi.wi + b , and activates (fires)
the neuron if the weighted sum is higher than a certain threshold.

• The simplest activation function used by the perceptron algorithm is


the “step function” that produces a binary output (0 or 1). It basically
says that, if the summed input ≥ 0, then it "fires" (output = 1). Else
(summed input < 0) it doesn't fire (output = 0).
How does the perceptron learn?
How does the perceptron learn?
• The neuron uses trial and error to learn from its mistakes. The perceptron’s
learning logic goes like this:
1. The neuron calculates the weighted sum and apply the activation function to
make a prediction ŷ. This is called feedforward process.

ŷ = activation (∑xi . wi + b)
2. It then compares the prediction with the actual to calculate the error
error =y -ŷ
3. Update the weight: if the prediction is too high, it will adjust the weights to
make a lower prediction next time and vice versa.
4. Repeat!
Is one neuron enough to solve complex problems?
• No. The perceptron is a linear function. It works great with simple datasets that
can be separated by a linear line.
Multi-Layer Perceptron Architecture
• A common neural network architecture is to stack the neurons in layers
on top of each other called hidden layers. Each layer has n number of
neurons. Layers are connected to each other by weights connections.
This leads to the Multi-Layer Perceptron (MLP) architecture
Multi-Layer Perceptron
The learning process is a repetition of three main steps:
1) Feedforward calculations to produce a prediction (weighted sum
and activation),
2) Calculate the error, and
3) Backpropagate the error and update the weights to minimize the
error
Feedforward/Forward Pass
• The process of computing the linear combination and applying activation
function is called Feedforward.

• This process happens through the implementation of two consecutive


functions:

1) the weighted sum, and

2) the activation function.

In short, the forward pass is the calculations through the layers to make a
prediction
• Let’s take a look at this simple three-layer neural network and
explore each of its components:
Feedforward calculations
• Calculations at layer1 ,

• Calculations at layer2 ,

• Output prediction in layer 3:


Backward Pass/Backpropagation
• Backpropagation or backward pass is propagating derivatives of the
error, with respect to each specific weight dE/dwi from the last layer
(output) back to the first layer (inputs) to adjust weights.
Neural Networks and Activation Function
• Three important steps in a single iteration of deep neural architectures:
Forward Propagation, Backward Propagation, and Gradient
Descent(Optimization).
1. Forward Propagation: input data is fed in the forward direction
through each layer of the neural network. The linear calculation takes
place in this step and the activation function is applied.
2. Back Propagation: calculation of all the derivatives which will be
further used for Optimization or updating the parameters.
3. Optimization: This step helps in the convergence of the loss function
by continuously updating the parameters in each iteration. Some
optimization algorithms are as follows: Gradient Descent, Momentum,
Adam, AdaGrad, RMSProp, etc.
ACTIVATION FUNCTIONS
Introduction to Activation Functions
 Activation functions are functions used in a neural network to compute the
weighted sum of inputs and biases, which is in turn used to decide whether a
neuron can be activated or not.
 It manipulates the presented data and produces an output for the neural network
that contains the parameters in the data.
 The activation functions are also referred to as transfer functions in some literature.
 These can either be linear or nonlinear depending on the function it represents and
is used to control the output of neural networks across different domains.
What is an Activation Function?

• The Activation Functions can be basically divided into 3 types


1. Binary Step Function
2. Linear Activation Function
3. Non-linear Activation Functions
WHY ACTIVATION FUNCTIONS?
The primary role of the Activation Functions is to transform the summed
weighted input from the node into an output value to be fed to the next
hidden layer or as output.
Binary Step Function
• Binary step function depends on a threshold value
that decides whether a neuron should be activated or
not.

• The input fed to the activation function is compared


to a certain threshold; if the input is greater than it,
then the neuron is activated, else it is deactivated,
meaning that its output is not passed on to the next
hidden layer.
Binary Step Function
• Binary Step Function is the simplest activation
function that exists and it can be implemented with
simple if-else statements in Python.

• While creating a binary classifier binary activation


function are generally used. But, binary step function
cannot be used in case of multiclass classification in
target carriable.
Linear Activation Function
• The linear activation function, also known as "no
activation," or “identity function”, is where the
activation is proportional to the input.

• The function doesn't do anything to the weighted


sum of the input, it simply spits out the value it
was given.
Linear Activation Function
• Linear function has the equation similar to as of a
straight line i.e. y = Wx + b

• It takes the input (Xi’s) multiplied by the weights


(Wi’s) for each neuron and creates an output
proportional to the input. In simple term, weighted
sum input is proportional to output.
Non-Linear Activation Functions
Sigmoid / Logistic Activation Function
• This function takes any real value as input and
outputs values in the range of 0 to 1.

• The larger the input (more positive), the


closer the output value will be to 1.0, whereas
the smaller the input (more negative), the
closer the output will be to 0.0, as shown
below.
Sigmoid / Logistic Activation Function

Here’s why sigmoid/logistic activation function is one of the most widely


used functions:

• It is commonly used for models where we have to predict the probability as an


output.

• The function is differentiable and provides a smooth gradient, i.e., preventing


jumps in output values. This is represented by an S-shape of the sigmoid
activation function.

The limitations of sigmoid function are :

• The derivative of the function is f'(x) = sigmoid(x)*(1-sigmoid(x))


Sigmoid / Logistic Activation Function

• From the Figure, the gradient values are only


significant for range -3 to 3, and the graph gets
much flatter in other regions.

• It implies that for values greater than 3 or less


than -3, the function will have very small
gradients. As the gradient value approaches zero,
the network ceases to learn and suffers from
the Vanishing gradient problem.
Tanh Function (Hyperbolic Tangent)
• Tanh function is similar to the sigmoid
activation function, and even has the same S-
shape with the difference in output range of -1
to 1.
• In Tanh, the larger the input (more positive),
the closer the output value will be to 1.0,
whereas the smaller the input (more negative),
the closer the output will be to -1.0.
Tanh Function (Hyperbolic Tangent)

Advantages of using Tanh activation function are:


• The output of the tanh activation function is Zero centered; hence we can easily
map the output values as strongly negative, neutral, or strongly positive.

• Usually used in hidden layers of a neural network as its values lie between -1 to
+1; therefore, the mean for the hidden layer comes out to be 0 or very close to it. It
helps in centering the data and makes learning for the next layer much easier.
ReLU Activation Function
• ReLU stands for Rectified Linear Unit.

• ReLU has a derivative function that allows

backpropagation while simultaneously making it

computationally efficient.

• The neurons will only be deactivated if the output of

the linear transformation is less than 0.


Leaky ReLU Function
• Leaky ReLU is an improved version of ReLU
function to solve the Dying ReLU problem as it
has a small positive slope in the negative area.

f(x)=max(0.01*x , x)
Leaky ReLU Function
• Leaky ReLU is defined to address problem of dying neuron/dead neuron
• Problem of dying neuron/dead neuron is addressed by introducing a small
slope having the negative values scaled by a enables their corresponding
neurons to “stay alive”
• The function and its derivative both are monotonic
• It allows negative value during back propagation
• It is efficient and easy for computation
• Derivative of Leaky is 1 when f(x) > 0 and ranges between 0 and 1 when
f(x) < 0
Leaky ReLU Function
Parameterised ReLU
• This is another variant of ReLU that aims to solve the problem of gradient’s
becoming zero for the left half of the axis. The parameterised ReLU, as the
name suggests, introduces a new parameter as a slope of the negative part of
the function. Here’s how the ReLU function is modified to incorporate the
slope parameter-
• f(x) = x, x>=0
• = ax, x<0
Parameterised ReLU

Parameterised ReLU
• When the value of a is fixed to 0.01, the function acts as a Leaky
ReLU function. However, in case of a parameterised ReLU
function, ‘a‘ is also a trainable parameter. The network also learns
the value of ‘a‘ for faster and more optimum convergence.
• The derivative of the function would be same as the Leaky ReLu
function, except the value 0.01 will be replcaed with the value of a.
• f'(x) = 1, x>=0
• = a, x<0
• The parameterized ReLU function is used when the leaky ReLU
function still fails to solve the problem of dead neurons and the
relevant information is not successfully passed to the next layer
Softmax Function
• Used when trying to handle multiple classes.
The softmax function would squeeze the
outputs for each class between 0 and 1 and
would also divide by the sum of the outputs.
Swish
• It is a self-gated activation function
developed by researchers at Google.

• Swish consistently matches or


outperforms ReLU activation function
on deep networks applied to various
challenging domains such as image
classification, machine translation etc.
Swish
• Mathematically it can be represented
as:
Swish
Here are a few advantages of the Swish activation function over ReLU:
Swish is a smooth function that means that it does not abruptly
change direction like ReLU does near x = 0. Rather, it smoothly
bends from 0 towards values < 0 and then upwards again.
Small negative values were zeroed out in ReLU activation function.
However, those negative values may still be relevant for capturing
patterns underlying the data. Large negative values are zeroed out for
reasons of sparsity making it a win-win situation.
The swish function being non-monotonous enhances the expression
of input data and weight to be learnt.
Activation for Hidden Layers

• There are perhaps three activation functions you may want to consider for
use in hidden layers; they are:

1. Rectified Linear Activation (ReLU)

2. Logistic (Sigmoid)

3. Hyperbolic Tangent (Tanh)

• This is not an exhaustive list of activation functions used for hidden


layers, but they are the most commonly used.
Activation for Output Layers

• We have activation function depending on whether the node in hidden

layer / output layer depending on what type of problem we are dealing.:

Activation function for your output layer:

• Regression - Linear Activation Function

• Binary Classification—Sigmoid/Logistic Activation Function

• Multiclass Classification—Softmax
What is a Good Activation Function?
• A proper choice has to be made in choosing the activation function to
improve the results in neural network computing. All activation
functions must be monotonic, differentiable, and quickly converging
with respect to the weights for optimization purposes.
Vanishing gradient &Exploding gradient
• In a network of n hidden layers, n derivatives will be multiplied together.
If the derivatives are large then the gradient will increase exponentially as
we propagate down the model until they eventually explode, and this is
what we call the problem of exploding gradient.

• Alternatively, if the derivatives are small then the gradient will decrease
exponentially as we propagate through the model until it eventually
vanishes, and this is the vanishing gradient problem.
Exploding gradient
Understanding Exploding Gradient in Deep
Learning
• Gradient Descent:
• In deep learning, we use a method called gradient descent to train neural networks.
• Gradient descent adjusts the weights (like knobs) in the network to minimize errors
and improve predictions.
• Gradient Explained:
• The "gradient" measures how much we need to change each weight to make the
network better at its job.
• It’s like figuring out which knobs to turn to make a machine work perfectly.
• What is Exploding Gradient?:
• Exploding gradient happens when these gradients become very large during training.
• Instead of small, manageable changes to the weights, the gradients grow so big that
they make the weights change wildly.
Exploding Gradient
• Why It’s a Problem:
• When gradients explode, the network’s weights can change so much that the
model becomes unstable.
• It can lead to the network making unpredictable predictions or even crashing
during training.
• Causes:
• Exploding gradients often happen in deep networks with many layers (like
many floors in a skyscraper).
• If the gradients get bigger and bigger as they pass through each layer, they
can explode at the end.
• Effects:
• Training becomes very slow and unstable because we have to use very small
steps (learning rates) to avoid the gradients from exploding.
• It can also affect how well the network learns and how accurate its
predictions are.
Dealing with Exploding Gradients
• Gradient Clipping:
• One technique to handle exploding gradients is called gradient clipping.
• It limits the size of the gradients during training so they don’t get too big.
• Choosing Learning Rates:
• We also carefully choose learning rates (step sizes) that are small enough to
avoid gradients from exploding.
• It’s like deciding how big each step should be to climb a mountain safely.
• Normalization Techniques:
• Using normalization methods like batch normalization helps keep the
gradients stable as they pass through each layer of the network.
• It’s like making sure everything is balanced and not too extreme.
Exploding gradient
• Summary
• Exploding gradient is when the changes in a neural network
become too big and unstable during training. It happens in
deep learning because of large gradients passing through
many layers. To deal with it, we use techniques like gradient
clipping and careful adjustment of learning rates to keep the
training stable and make sure our networks learn effectively.
Exploding gradient
Exploding Gradient Vanishing Gradient
The parameters of the higher
There is an exponential layers change significantly
growth in the model whereas the parameters of lower
parameters. layers would not change much
(or not at all).

The model weights may


The model weights may become
become NaN during
0 during training.
training.
The model learns very slowly
The model weights may
and perhaps the training
become NaN during
stagnates at a very early stage
training.
just after a few iterations.
What Is Convergence In Neural Network
• Convergence in neural networks is a process of adjusting
the weights and biases to improve the accuracy of a model.

OR

• Convergence refers to the process by which a deep neural network


gradually improves its performance through repeated iterations of
training. It involves adjusting the network's weights and biases to
minimize the difference between predicted and actual outputs.
What Is Convergence In Neural Network
• The goal of convergence is to find the optimal weights and biases
that will minimize the error rate of the model.

• Convergence is an important part of training a neural network and is


essential for achieving the best possible accuracy.
How Does Convergence Work?
• Convergence is done by using an optimization algorithm such as
gradient descent.

• The algorithm works by calculating the error rate of the model and
then adjusting the weights and biases accordingly.

• This process is repeated until the error rate is minimized and the
model is optimized.
Why does a neural net fail to converge?
• Implementation of not enough nodes may be a reason behind this issue
because models with fewer nodes need to change their architecture
drastically to model the data better and fail to converge.

• The amount of the training data is low or the data we are pushing on the
model is corrupted or not collected with the data integrity.

• The activation function we are using with the network often leads to good
results from the model but if complexity is higher then the model can fail to
converge.
Why does a neural net fail to converge?
• Inappropriate weight application in the network can also cause a
failure in convergence. The weights we are applying to the network
should be well calculated according to the activation function.

• The learning rate parameter we have given in the network should


be moderate which means it should not be much larger or much
lower.
Difficulties in Convergence
• It can be difficult to find the optimal weights and biases for a model.
This is because the optimization process is complex and requires a
lot of trial and error.

• The optimization process can be time-consuming and can require a


lot of computing power.

• The optimization process can be prone to errors, which can lead to


inaccurate results.
How To Improve Convergence?
There are several ways to improve the convergence process.

• One way is to use a more sophisticated optimization algorithm such as


Adam or RMSProp. These algorithms are more efficient and can help to
reduce the amount of time needed to optimize a model.

• It is important to use a large enough dataset to ensure that the model is


properly trained.

• Finally, it is important to use regularization techniques such as


dropout or batch normalization to reduce the risk of overfitting.
Remedies for convergence failure
• Implementing momentum: sometimes convergence depends on the data and if
the data is making a model producing errors like a hair comb. The
implementation of neural network momentum can help in avoiding
convergence and also helps in boosting the accuracy and speed of the model.

• Reinitialization of the weights of the network can help in avoiding the failure
of convergence.

• If the training is stuck in the local minima and subsequent sessions have
exceeded max iteration, this means the session has failed and we will get a
higher error. In such a situation starting another session can be helpful.
Remedies for convergence failure
• Change in the activation function can be helpful. For example, we are using a ReLU
activation and the neurons of the nodes become biased and this can cause the neuron to
never be activated. In such a situation changing the activation function to another
activation can be helpful.

• While performing classification using neural networks, then we can use the shuffling of
the training data to avoid the failure in convergence.

• The learning rate and the number of epochs should be proportional while modelling
a network. Applying a lower number of epochs causes the convergence to happen in
smaller steps and a bigger number of epochs there will mean a long wait in the
appearance of the convergence. A higher learning rate or the number of epochs should be
avoided to make the neural network converge faster.
COST FUNCTION VS LOSS FUNCTION
COST FUNCTION VS LOSS FUNCTION
Loss Functions
• Loss functions are one of the most important aspects of neural networks, as they are

directly responsible for fitting the model to the given training data.

• A neural network processes the input data at each layer and eventually produces a

predicted output value.

• Each training input is loaded into the neural network in a process called forward

propagation. Once the model has produced an output, this predicted output is

compared against the given target output in a process called backpropagation — the

weights and biases of the model are then adjusted so that it now outputs a result

closer to the target output.


Loss Functions
• A loss function is a function that compares the target and predicted output
values; measures how well the neural network models the training data.

• The hyperparameters are adjusted to minimize the average loss — we find the
weights, wT, and biases, b, that minimize the value of J (average loss).

• which measure the distance of the actual y values from the regression line
(predicted values) — the goal being to minimize the net distance.
Types of Loss Functions
There are two main types of loss functions

• Regression Loss Functions - used in regression neural networks;

Ex. Mean Squared Error, Mean Absolute Error

• Classification Loss Functions - used in classification neural networks;

Ex. Binary Cross-Entropy, Categorical Cross-Entropy


Types of Loss Functions

Mean Squared Error (MSE): One of the most popular loss functions,
MSE finds the average of the squared differences between the target and the
predicted outputs

• One disadvantage of this loss function is that it is very sensitive to outliers;


if a predicted value is significantly greater than or less than its target value,
this will significantly increase the loss.
Types of Loss Functions
Mean Absolute Error (MAE): MAE finds the average of the absolute
differences between the target and the predicted outputs.

• This loss function is used as an alternative to MSE. As mentioned previously,


MSE is highly sensitive to outliers, which can dramatically affect the loss
because the distance is squared. MAE is used in cases when the training data
has a large number of outliers to mitigate

• It also has some disadvantages; as the average distance approaches 0, gradient


descent optimization will not work, as the function's derivative at 0 is undefined
(which will result in an error, as it is impossible to divide by 0).
Types of Loss Functions

• A loss function called a Huber Loss was developed, which has


the advantages of both MSE and MAE.

• If the absolute difference between the actual and predicted value


is less than or equal to a threshold value, 𝛿, then MSE is applied.
Otherwise, if the error is sufficiently large MAE is applied.
Types of Loss Functions
Binary Cross-Entropy/Log Loss
• This is the loss function used in binary classification models - where
the model takes in an input and has to classify it into one of two pre-
set categories.

• In binary classification, there are only two possible actual values of y


— 0 or 1. Thus, to accurately determine loss between the actual and
predicted values, it needs to compare the actual value (0 or 1) with
the probability that the input aligns with that category (p(i) =
probability that the category is 1; 1 — p(i) = probability that the
category is 0)
Types of Loss Functions

Categorical Cross-Entropy Loss


• In cases where the number of classes is greater than two, we
utilize categorical cross-entropy — this follows a very similar
process to binary cross-entropy.

• Binary cross-entropy is a special case of categorical cross-entropy,


where M = 2 - the number of categories is 2.
VANISHING GRADIENT PROBLEM
VANISHING GRADIENT PROBLEM
•What it is: The vanishing gradient problem happens when the changes
needed to update the weights in a neural network become very small.
•Why it happens: This often occurs with activation functions like sigmoid
or tanh, where small values multiply through many layers, making
gradients tiny.
•Effect: This causes the network to learn very slowly or not at all,
especially in the early layers, resulting in poor performance.
VANISHING GRADIENT PROBLEM
VANISHING GRADIENT PROBLEM
Derivative of sigmoid activation function
VANISHING GRADIENT PROBLEM
VANISHING GRADIENT PROBLEM
VANISHING GRADIENT PROBLEM
VANISHING GRADIENT PROBLEM
VANISHING GRADIENT PROBLEM
VANISHING GRADIENT PROBLEM
VANISHING GRADIENT PROBLEM
VANISHING GRADIENT PROBLEM
• https://chatgpt.com/share/ff4ad236-0734-406c-8682-
46975b6fa9c9
Optimizers
• Optimizer algorithms are optimization method that helps improve a
deep learning model’s performance.
• Optimizers are algorithms used to find the optimal set of parameters
for a model during the training process.
• These algorithms adjust the weights and biases in the model
iteratively until they converge on a minimum loss value.
• These optimization algorithms or optimizers widely affect the
accuracy and speed training of the deep learning model.
Gradient Descent
• Gradient Descent (GD) is a fundamental optimization algorithm used in machine learning
and neural networks to minimize a loss function by iteratively adjusting the weights
(parameters) of the model. Here's a concise explanation:
• Objective: The goal of Gradient Descent is to find the set of parameters (weights) for a
machine learning model that minimizes a given loss function L(θ).
• Process:
• Initialization: Start with initial values for the model parameters θ\thetaθ.
• Compute Gradient: Calculate the gradient (derivative) of the loss function with
respect to each parameter.
• Update Parameters: Adjust the parameters in the opposite direction of the gradient to
minimize the loss function:
• where η (learning rate) controls the size of the steps taken during optimization.
• Iterate: Repeat the above steps until the algorithm converges, meaning the
parameters reach a point where further adjustments do not significantly reduce the
loss.
• Types of Gradient Descent:
• Batch Gradient Descent: Computes the gradient using the entire dataset.
• Stochastic Gradient Descent (SGD): Computes the gradient using one
randomly chosen sample from the dataset.
• Mini-batch Gradient Descent: Computes the gradient using a subset (batch) of
the dataset, balancing efficiency and smoothness of updates.
• Gradient Descent is crucial in training models across various domains, including
regression, classification, and deep learning, where finding optimal parameters is
essential for achieving accurate predictions.
Gradient Descent
• It’s used in linear regression and classification algorithms.
• It uses whole dataset for training the model at once for each epoch and update the
weights
• It directly uses the derivative of the loss function and learning rate to reduce the
loss and achieve the minima.

• The weights are updated when the whole dataset gradient is calculated, which
slows down the process. It also requires a large amount of memory to store this
temporary data, making it a resource-hungry process.
Gradient Descent
Advantages
• Simple to implement.
• Can work well with a well-tuned
learning rate.
Disadvantages
• As the method calculates the gradient
for the entire data set in one update, the
calculation is very slow.
• It requires large memory and it is
computationally expensive.
Stochastic Gradient Decent
• This is a variation of the GD, where the model parameters are
updated on every iteration. It means that after every training
sample, the loss function is tested and the model is updated.
• If the model has 10K dataset SGD will update the model
parameters 10k times.
Stochastic Gradient Decent
• These frequent updates result in converging to the minima in less
time, but it comes at the cost of increased variance that can make the
model overshoot the required position.
Advantages:
• It can be faster than standard gradient descent, especially for large
datasets.
• Can escape local minima more easily.
Disadvantages:
• It can be noisy, leading to less stability.
• It may require more hyperparameter tuning to get good
performance.
Mini-Batch Gradient Descent
• Mini-batch gradient descent is similar to SGD, but instead of using a
single sample to compute the gradient, it uses a small, fixed-size
"mini-batch" of samples. The update rule is the same as for SGD,
except that the gradient is averaged over the mini-batch. This can
reduce noise in the updates and improve convergence.
Mini-Batch Gradient Descent
Advantages:
• It can be faster than standard gradient descent, especially for large
datasets.
• Can escape local minima more easily.
• Can reduce noise in updates, leading to more stable convergence.
Disadvantages:
• Can be sensitive to the choice of mini-batch size.
Gradient Descent (GD), Stochastic Gradient Descent (SGD), and Mini-batch
Gradient Descent images
Gradient Descent (GD), Stochastic Gradient Descent (SGD), and Mini-batch
Gradient Descent images
Summary GD vs SGD vs Mini-batch GD
• Gradient Descent (GD): Best for small datasets where computational
cost per update is manageable. Provides stable convergence.
• Stochastic Gradient Descent (SGD): Best for very large datasets or
online learning. Provides faster updates but with higher variance.
• Mini-batch Gradient Descent: Combines benefits of both GD and
SGD, suitable for large datasets, balances speed and stability, and
makes efficient use of computational resources.
Drawbacks of base optimizers: (GD, SGD, mini-
batch GD)
• Gradient Descent uses the whole training data to update weight and bias.
Suppose if we have millions of records then training becomes slow and
computationally very expensive.

• SGD solved the Gradient Descent problem by using only single records to
updates parameters. But, still, SGD is slow to converge because it needs
forward and backward propagation for every record. And the path to reach global
minima becomes very noisy.

• Mini-batch GD overcomes the SGD drawbacks by using a batch of records to


update the parameter. Since it doesn't use entire records to update parameter, the
path to reach global minima is not as smooth as Gradient Descent.
• The figure is the plot
between the number of
epoch on the x-axis and
the loss on the y-axis.
We can clearly see that
in Gradient Descent the
loss is reduced smoothly
whereas in SGD there is
a high oscillation in loss
value.
What is an Epoch?
• Epoch Definition: An epoch is one full cycle through the training dataset. This
means that during one epoch, the model has seen each training example once.
• Epoch in the Training Process
a. Data Splitting: Before training starts, the dataset is usually split into training,
validation, and test sets. The training set is used to update the model parameters,
the validation set is used to tune hyperparameters and avoid overfitting, and the
test set is used to evaluate the final model performance.
b. Training Iterations:
i. Batch: Instead of processing the entire training dataset at once, the data is
often divided into smaller subsets called batches. Each batch contains a fixed
number of training examples.
ii. Iteration: One iteration refers to the process of passing one batch through the
network, performing a forward and backward pass, and updating the model
parameters accordingly.
What is an Epoch?
• Complete Epoch:
a. During one epoch, the model goes through all the batches in the training
set once.
b. If the dataset contains N examples and the batch size is B, then the
number of iterations per epoch is N/B​.
• Importance of Multiple Epochs
a. Convergence: Training a model with only one epoch is typically
insufficient because the model hasn't seen enough data to learn effectively.
Multiple epochs allow the model to see the same examples multiple times,
enabling it to learn and refine its parameters.
b. Learning and Generalization: With each epoch, the model's weights are
updated, which helps the model to better generalize from the training data to
unseen data. This process helps the model to converge towards a solution
that minimizes the loss function.
Example for Epoch
• Suppose you have a dataset with 1000 examples, and you choose a batch size of 100.
Here's how the training process with epochs would look:

• During each epoch, the model parameters are updated in small steps based on the
loss calculated from the batches, leading to gradual learning and improvement of the
model.
• In summary, an epoch is a critical concept in deep learning that signifies a complete
pass through the training dataset. Training for multiple epochs allows the model to
iteratively learn and refine its parameters, ultimately leading to better performance
and generalization
SGD with momentum
• SGD with momentum is a variant of SGD that adds a "momentum" term to
the update rule, which helps the optimizer to continue moving in the same
direction even if the local gradient is small. The momentum term is typically
set to a value between 0 and 1.

• The momentum is calculated using the following formula


SGD with momentum
• In this algorithm, we use Exponentially Weighted Averages to compute
Gradient and used this Gradient to update parameter.
• An equation to update weights and bias in SGD with momentum
SGD with momentum
• In SGD with momentum, we have added momentum in a
gradient function. By this I mean the present Gradient is
dependent on its previous Gradient and so on. This
accelerates SGD to converge faster and reduce the
oscillation.

The above images shows how the convergence happens in SGD


with momentum vs SGD without momentum.
SGD with momentum
Advantages of SGD with momentum

• Momentum helps to reduce the noise.

• It can help to reduce oscillations and improve convergence.

Disadvantage of SGD with momentum

• Requires tuning of the momentum hyperparameter.

• Can overshoot good solutions and settle for suboptimal ones


if the momentum is too high.
Adagrad (Adaptive Gradient Descent)
• Adagrad is an optimization algorithm that uses an
adaptive learning rate per parameter.

• The learning rate is updated based on the historical


gradient information so that parameters that receive
many updates have a lower learning rate, and parameters
that receive fewer updates have a larger learning rate.
Adagrad (Adaptive Gradient Descent)
• The Adagrad algorithm uses the below formulas..

• In the above Adagrad optimizer equation, the learning rate has


been modified in such a way that it will automatically decrease
because the summation of the previous gradient square will always
keep on increasing after every time step.
Adagrad (Adaptive Gradient Descent)
Advantages:
• It can work well with sparse data.
• Automatically adjusts learning rates based on parameter
updates.
Disadvantages:
• Can converge too slowly for some problems.
• Can stop learning altogether if the learning rates become too
small.
Adadelta or RMSprop (Root Mean Square Propagation)
• It is an extension of AdaGrad which tends to remove the decaying
learning Rate problem of it.

• It uses an exponentially decaying average of the gradients and the


squares of the gradients. Instead of accumulating all previously
squared gradients, Adadelta limits the window of accumulated past
gradients to some fixed size w. For example, computing the
squared gradient of the past 10 gradients and average out.

• The formula for the new weight remains the same as in Adagrad
Adadelta over Adagrad
• The first modification involves using an exponentially
decaying average of squared gradients instead of their
cumulative sum. This allows the optimizer to adapt to
recent gradients while forgetting the older ones, enabling
more flexibility during training.
• The second modification introduces an additional
parameter, ρ (rho), which controls the ratio between the
update step size and the exponentially decaying average of
squared gradients. By adjusting ρ, Adadelta further
improves its adaptability.
Adadelta
• The Learning rate is calculated using the formula,

• Thus because of the restricting term, the weighted average


will increase at a slower rate, making the learning rate to
reduce slowly to reach the global minima.
Adadelta
Advantages:

• Can work well with sparse data.

• Automatically adjusts learning rates based on parameter


updates.
Disadvantages:

• Can converge too slowly for some problems.

• Can stop learning altogether if the learning rates become


too small.
comparison of AdaGrad and AdaDelta:
Feature AdaGrad (Adaptive Gradient Algorithm) AdaDelta (Adaptive Delta)
Uses exponentially decaying average of
Gradient Accumulation Accumulates all past squared gradients
squared gradients
Learning Rate Maintains consistent learning rate using a
Decreases over time due to accumulated gradients
Adjustment moving average
Learning Rate Requires a global learning rate No need for manual global learning rate

Parameter Update Rule

Handles sparse data well due to dynamic


Handling of Sparse Data Effective due to adaptive learning rates
adjustment
Learning rate can become very small, slowing Avoids diminishing learning rates, ensuring
Convergence Issues
convergence stable convergence
Suitable for deep learning models and complex
Practical Use Suitable for sparse datasets
optimization tasks
Bias towards Recent Less emphasis on recent updates due to cumulative Higher emphasis on recent updates due to
Updates gradient sum moving average
Learning rate depends on the entire history of Learning rate depends on a fixed window of
Parameter Dependency
gradients recent gradients
Adam (Adaptive Moment Estimation)
• Adam optimizer is one of the most popular optimization algorithm.

• It is widely used in machine learning, especially deep learning, due


to its efficiency and effectiveness in handling large-scale data and
parameters.

• It is a method that computes adaptive learning rates for each


parameter. It stores both the decaying average of the past gradients
, similar to momentum and also the decaying average of the past
squared gradients , similar to RMS-Prop and Adadelta. Thus, it
combines the advantages of both the methods.
Key Concepts of Adam Optimizer
• Adam optimizer is one of the most popular optimization algorithm.

• It is widely used in machine learning, especially deep learning, due


to its efficiency and effectiveness in handling large-scale data and
parameters.

• It is a method that computes adaptive learning rates for each


parameter. It stores both the decaying average of the past gradients
, similar to momentum and also the decaying average of the past
squared gradients , similar to RMS-Prop and Adadelta. Thus, it
combines the advantages of both the methods.
Adam
• Exponential Weighted Averages for past gradients

• Exponential Weighted Averages for past squared gradients

Where β and γ are the initial restricting parameters for SGD with
Momentum and Adadelta respectively.
Adam
• Using the previous equation, now the weight and bias updation
formula looks like:

Advantages:

• Can converge faster than other optimization algorithms.

• Can work well with noisy data.


Disadvantages:

• It may require more tuning of hyperparameters than other


algorithms.

• May perform better on some types of problems.


Learning Rate Decay
Learning Rate Decay
• In the very first image where we have a constant learning rate, the steps
taken by our algorithm while iterating towards minima are so noisy that
after certain iterations it seems wandering around the minima and do not
actually converges.

• But in the second image where learning rate is reducing over time
(represented with green line), since the learning rate is large initially we
still have relatively fast learning but as tending towards minima learning
rate gets smaller and smaller, end up oscillating in a tighter region around
minima rather than wandering far away from it.
Learning Rate Decay
Learning rate decay (common method):
α=(1/(1+decayRate×epochNumber))*α0
 1 epoch : 1 pass through data
 α or η : learning rate (current iteration)
 α0 or η : Initial learning rate
 decayRate : hyper-parameter for the method
Example: Suppose we have α0 = 0.2 and decay rate=1 , then for the each
epoch we can examine the fall in learning rate α as:
 Epoch 1: alpha 0.1
 Epoch 2: alpha 0.067
 Epoch 3: alpha 0.05
 Epoch 4: alpha 0.04
Learning Rate Decay
Exponential Decay : The decayRate of this method is always less
then 1 , 0.95 is most commonly used among practitioners.

Epoch Number Based: In this method we take some constant ‘k’


and divide it with square root of epoch number.

Mini-batch Number based : In this method we take some constant


‘k’ and divide it with square root of Mini-Batch number. (This
method is only used for Mini Batch Gradient Descent.)
Learning Rate Decay
Discrete Staircase :In this method learning rate is decreased in
some discrete steps after every certain interval of time , for example
you are reducing learning rate to its half after every 10 secs.

Manual Decay : In this method practitioners manually examine the


performance of algorithm and decrease the learning rate manually
day by day or hour by hour etc.
Local Optima
• Definition:
• Local optima are points in the search space where the function (or loss) has the
lowest value compared to its immediate neighbors.
• It’s like finding a low point on a hill in the area you’re standing but not
necessarily the lowest point overall.
• Example:
• Imagine you’re hiking on a mountain with many valleys and hills. A local
optimum is where you’re standing and it’s lower than the surrounding points,
but not necessarily the lowest valley on the whole mountain.
• Challenge:
• In optimization problems, finding a local optimum can be misleading because it
might not be the best solution globally (overall lowest point).
• Algorithms like gradient descent can get stuck in local optima if the initial
starting point is not well chosen.
Local Optima
Spurious Optima
• Definition:
• Spurious optima are points that look like optima (lowest points) but are actually
not the best solutions.
• These can occur due to noise or irregularities in the data, fooling the
optimization algorithm.
• Example:
• Imagine hiking in a foggy area where small dips or bumps look like low points,
but they’re just random features of the terrain, not the real valleys.
• Challenge:
• Spurious optima are problematic because they can mislead optimization
algorithms into thinking they’ve found the best solution when they haven’t.
• They can occur in complex datasets or noisy environments where the true
structure is hidden.
Spurious Optima
• Conclusion Understanding local and spurious optima is essential for
optimizing machine learning models effectively. While local optima are lower
points nearby, spurious optima are misleading low points that are not the best
solutions. Techniques like gradient descent help navigate towards better
solutions, ensuring models perform well on unseen data.
Weight Initialization
• Its main objective is to prevent layer activation outputs from exploding or
vanishing gradients during the forward propagation. If either of the problems
occurs, loss gradients will either be too large or too small, and the network
will take more time to converge

• Training the network without a useful weight initialization can lead to a very
slow convergence or an inability to converge

The most used weight initialization techniques are:

1. Zero Initialization (Initialized all weights to 0)

2. Random initialization
weight initialization
Zero Initialization (Initialized all weights to 0)

• If we initialized all the weights with 0, then what


happens is that the derivative wrt loss function is the
same for every weight in W[l], thus all weights have the
same value in subsequent iterations. This makes a model
as a linear model.
weight initialization
Random initialization : Assigning random values to weights is better than

just 0 assignment. But what happens if weights are initialized high values or

very low values and what is a reasonable initialization of weight values.

a. If weights are initialized with very high values the

term np.dot(W,X)+b becomes significantly higher and if an activation

function like sigmoid() is applied, the function maps its value near to 1

where the slope of gradient changes slowly and learning takes a lot of

time.
weight initialization
Random initialization :

b. If weights are initialized with low values it gets mapped to

0, where the case is the same as above. This problem is

often referred to as the vanishing gradient.


Preprocessing
• Data preprocessing refers to the cleaning, transforming,
and integrating of data in order to make it ready for
training.
• The goal of data preprocessing is to improve the quality of
the data and to make it more suitable for the specific task.
Some common steps in data preprocessing include:
• Data Cleaning: This involves identifying and correcting
errors or inconsistencies in the data, filling missing values,
smoothing the noisy data, resolving the inconsistency, and
removing outliers.
Data Cleaning
Preprocessing
• Data Integration: This involves combining data from multiple sources to
create a unified dataset. Data integration can be challenging as it requires
handling data with different formats, structures, and semantics.

Some issues while adopting Data Integration

• Schema integration and object matching: The data can be present in


different formats, and attributes that might cause difficulty in data
integration.

• Removing redundant attributes from all data sources.

• Detection and resolution of data value conflicts.


Preprocessing
• Data Transformation: Once data clearing has been done, we need to
consolidate the quality data into alternate forms by changing the value,
structure, or format of data using the below-mentioned Data
Transformation strategies.
Normalization: It is the most important Data Transformation
technique widely used. The numerical attributes are scaled up or down
to fit within a specified range. Normalization can be done in multiple
ways, which are highlighted here:
Min-max normalization
Z-Score normalization
Decimal scaling normalization
Preprocessing
Data Transformation:
• Attribute Selection: New properties of data are created from existing
attributes to help in the data mining process. For example, date of birth, data
attribute can be transformed to another property like is_senior_citizen for
each tuple, which will directly influence predicting diseases or chances of
survival, etc.

• Aggregation: It is a method of storing and presenting data in a summary


format. For example sales, data can be aggregated and transformed to show
as per month and year format.
Preprocessing
• Data Reduction: The size of the dataset can be too large to be handled. So,
possible solution is to obtain a reduced representation of the dataset that is
much smaller in volume but produces the same quality
Data Reduction strategies:
• Data aggregation: It is a way of data reduction, in which the gathered data
is expressed in a summary form.
• Dimensionality reduction: These techniques are used to perform feature
extraction. Dimensionality reduction can be done using techniques like
Principal Component Analysis etc.
• Data compression: By using encoding technologies, the size of the data can
significantly reduce.
Preprocessing
Data Reduction:
• Discretization: Data discretization is used to divide the attributes of
the continuous nature into data with intervals. For example, attribute
age can be discretized into bins like below 18, 18-44, 44-60, above
60

• Attribute subset selection: It is very important to be specific in the


selection of attributes. Otherwise, it might lead to high dimensional
data, which are difficult to train due to underfitting/overfitting
problems. Only attributes that add more value towards model
training should be considered, and the rest all can be discarded.
Regularization
• Regularization refers to a set of different techniques that lower the
complexity of a neural network model during training, and thus
prevent the overfitting.
• There are three very popular and efficient regularization techniques
called L1, L2, and dropout

• Regularization is a technique which makes slight modifications to the


learning algorithm such that the model generalizes better. This in turn
improves the model’s performance on the unseen data as well.
Regularization:
Different Regularization Techniques in Deep Learning
L1 and L2 are the most common types of regularization. These
update the general cost function by adding another term known as
the regularization term.
Cost function = Loss (say, binary cross entropy) + Regularization
term
• Due to the addition of this regularization term, the values of weight
matrices decrease because it assumes that a neural network with
smaller weight matrices leads to simpler models. Therefore, it will
also reduce overfitting to quite an extent.
• In L2, we have:
Regularization:
• L2 regularization is also known as weight decay as it forces the
weights to decay towards zero (but not exactly zero).
• In L1 regularization, we have:

• In this, we penalize the absolute value of the weights. Unlike L2,


the weights may be reduced to zero here. Hence, it is very useful
when we are trying to compress our model. Otherwise, we
usually prefer L2 over it.
Regularization:
Dropout
• In addition to the L2 and L1 regularization, another famous and
powerful regularization technique is called the dropout
regularization.
• In a nutshell, dropout means that during training with some
probability P a neuron of the neural network gets turned off
during training. Let’s look at a visual example.
Regularization:
Dropout: In figure
• Assume on the left side we have a feedforward neural network with no
dropout. Using dropout with let’s say a probability of P=0.5 that a
random neuron gets turned off during training would result in a neural
network on the right side.
• In this case, you can observe that approximately half of the neurons are
not active and are not considered as a part of the neural network. And
as you can observe the neural network becomes simpler.
• A simpler version of the neural network results in less complexity that
can reduce overfitting. The deactivation of neurons with a certain
probability P is applied at each forward propagation and weight update
step.
• Overfitting occurs in more complex neural network models (many layers,
many neurons)

• Complexity of the neural network can be reduced by using L1 and L2


regularization as well as dropout

• L1 regularization forces the weight parameters to become zero

• L2 regularization forces the weight parameters towards zero (but never


exactly zero)

• Smaller weight parameters make some neurons neglectable → neural


network becomes less complex → reduces overfitting

• During dropout, some neurons get deactivated with a random


probability P → Neural network becomes less complex → reduces
overfitting
Dropout in Neural Networks
• Concept:
• Dropout is a regularization technique used during the training of
neural networks to prevent overfitting and improve generalization.
• It involves randomly "dropping out" (ignoring) some neurons during
each iteration of the training process.
• Why Dropout?
• Neural networks can become too specialized to the training data,
leading to poor performance on new, unseen data (overfitting).
• Dropout helps in training more robust networks by forcing the
network to learn redundant representations and not rely too heavily
on specific neurons.
How Dropout Works:
• Randomly Ignoring Neurons:
• During each training iteration, each neuron (or node) in the network is temporarily
"dropped out" with a certain probability (typically around 0.2 to 0.5).
• This means the neuron's output is ignored and it doesn't contribute to the forward pass
or backward pass of that iteration.
• Forcing Adaptability:
• By randomly dropping neurons, dropout forces the network to adapt and learn more
robust features that are not dependent on specific neurons being present.
• It prevents complex co-adaptations of neurons, making the network more
generalizable to new data.
• Implementation:
• Dropout is usually applied to hidden layers during training and turned off during
inference (when making predictions).
• It can be implemented easily in most neural network frameworks by including dropout
layers with a specified dropout rate after each hidden layer.
Advantages of Dropout:
• Regularization: It helps prevent overfitting by introducing noise and
reducing the network's sensitivity to specific weights.
• Improved Generalization: Networks trained with dropout tend to
generalize better to unseen data.
• Ensemble Learning Effect: Each training iteration with dropout can
be seen as training a different sub-network, akin to ensemble learning.
Considerations:
• Training Time: Dropout may increase training time per epoch since
fewer neurons are active at any given time.
• Hyperparameter Tuning: The dropout rate needs to be chosen
carefully through experimentation, as too high or too low rates can
affect performance.
Dropout in Neural Networks
• In Practice:
• Dropout is widely used in various types of neural networks, including
convolutional neural networks (CNNs) for image recognition and
recurrent neural networks (RNNs) for sequence modeling.
• It's a powerful tool alongside other regularization techniques like weight
decay (L2 regularization) to improve model performance.
• In essence, dropout is a straightforward yet effective technique to
enhance the robustness and generalization capability of neural networks,
making them more reliable and accurate in real-world applications.
• https://www.simplilearn.com/tutorials/deep-learning-
tutorial/deep-learning-applications
• https://www.v7labs.com/blog/neural-networks-activation-
functions
• https://ainxt.co.in/deep-learning-on-types-of-activation-
functions/
• https://www.slideshare.net/slideshow/what-is-deep-learning-introduction-
to-deep-learning-deep-learning-tutorial-simplilearn/95198895
• https://www.simplilearn.com/introduction-to-deep-learning-free-course-
skillup
• https://www.slideshare.net/slideshow/deep-learning-tutorial-deep-
learning-tutorial-for-beginners-what-is-deep-learning-
simplilearn/148243862
• https://www.interviewbit.com/blog/deep-learning-vs-machine-learning/

• https://www.slideshare.net/slideshow/deep-learning-tutorial-deep-
learning-tensor-flow-deep-learning-with-neural-networks-
simplilearn/95199538
Thank You!
someone@example.com

You might also like