Deep Learning Module-01 Notes
Deep Learning Module-01 Notes
Module 1
Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
For example, In the cats and dogs classification, the deep learning models will
extract information such as the eyes, face, and body shape of animals and
divide them into two classes.
The deep learning model consists of deep neural networks. The simple neural
network consists of an input layer, a hidden layer, and an output layer.
Deep learning models consist of multiple hidden layers, with additional layers
that the model's accuracy has improved.
In a deep learning model, the input layer receives the raw data — such as image
pixels or text. This data is passed through multiple hidden layers, where each
layer gradually learns more refined and abstract features.
The early layers detect basic patterns, like edges or shapes, while deeper layers
focus on more complex patterns, like object parts.
Finally, the output layer uses the information extracted by the hidden layers to
make a prediction, such as classifying an image as a dog instead of a cat.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
1.7.1 Shallow neural network: - The Shallow neural network has only one hidden layer
between the input and output as shown in the below figure.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
1.7.2 Deep Neural Networks: - It is a neural network that incorporates the complexity
of a certain level, which means several numbers of hidden layers are encompassed in
between the input and output layers. They are highly proficient on model and process
non-linear associations.
Machine Learning and Deep Learning are the two main concepts of Data
Science and the subsets of Artificial Intelligence.
Most of the people think the machine learning, deep learning, and as well as
artificial intelligence as the same buzzwords. But in actuality, all these terms are
different but related to each other.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
2. Perceptron: -
The Perceptron is a basic building block of an Artificial neural network,
introduced by Frank Rosenblatt in 1957.
A perceptron is a binary classifier that has one layer of input nodes and it directly
produces output.
The inputs are multiplied by weights and the weighted sum is calculated.
A bias term b is added to the weighted sum.
The result is passed through an activation function commonly a step function or
a threshold function.
The output is binary(0 or 1) based on the result of the activation.
Types of Perceptron: -
1. Single layer perceptron
2. Multi-layer perceptron
2. Multi-layer Perceptron: -
A multi-layer perceptron model also has the same model structure but has a
greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation
algorithm, which executes in two stages as follows:
-> Forward Stage
-> Backward Stage
I. Forward Stage: Activation functions start from the input layer in the forward
stage and terminate on the output layer.
II. Backward Stage: It refers to the backpropagation process, where the network
adjusts its weights based on the error from the output layer. This stage helps the
perceptron to learn and improve its predictions.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
1. Input Layer: - This is the primary component of Perceptron which accepts the initial data
into the system for further processing. Each input node contains a real numerical value.
2. Weights and Bias: - Weight parameter represents the strength of the connection between
units. This is another most important parameter of Perceptron components. Bias can be
considered as the line of intercept in a linear equation.
3. Activation Function: - The activation function does the non-linear transformation to the
input making it capable to learn and perform more complex tasks.
2. The perceptron model begins with the multiplication of all input values and their weights,
then adds these values together to create the weighted sum.
3. The weighted sum is applied to the activation function 'f' to obtain the desired output.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
4. This activation function is also known as the step function and is represented by 'f'.
5. This Activation function plays a vital role in ensuring that output is mapped between
required values (0,1) or (-1,1).
Step-1: - In the first step first, multiply all input values with corresponding weight values and
then add them to determine the weighted sum. Mathematically, we can calculate the weighted
sum as follows:
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2: - In the second step, an activation function is applied with the above-mentioned
weighted sum, which gives us output either in binary form or a continuous value as follows:
Y = f (∑wi*xi + b)
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
3. Multilayer Perceptron: -
1. Multi-Layer Perceptron (MLP) is an artificial neural network widely used for solving
classification and regression tasks.
2. A Multi-Layer Perceptron is a type of Feed forward neural network with multiple neurons
arranged in layers.
3. MLP consists of fully connected dense layers that transform input data from one dimension
to another. It is called “multi-layer” because it contains an input layer, one or more hidden
layers, and an output layer.
4. The purpose of an MLP is to model complex relationships between inputs and outputs,
making it a powerful tool for various machine learning tasks.
5. The above diagram shows the pictorial representation of MLP. Every connection in the
diagram is a representation of the fully connected nature of an MLP.
6. This means that every node in one layer connects to every node in the next layer. The data
moves through the network, each layer transforms it until the final output is generated in the
output layer.
Input Layer: Each neuron or node in this layer corresponds to an input feature. For
instance, if you have three input features, the input layer will have three neurons.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Hidden Layers: An MLP can have any number of hidden layers, with each layer
containing any number of nodes. These layers process the information received from
the input layer.
Output Layer: The output layer generates the final prediction or result. If there are
multiple outputs, the output layer will have a corresponding number of neurons.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
The activation function is the non-linear transformation that we do over the input
signals of hidden neurons.
This transformed output is then sent to the next layer of neurons as input.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Equation : f(x) = x
Range : (-infinity to infinity)
Note: - The output of the functions will not be confined between any range.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
The Nonlinear Activation Functions are the most used activation functions.
It makes it easy for the model to generalize or adapt with variety of data and to
differentiate between the output.
The Non-linear Activation Functions are mainly divided on the basis of their range or
curves.
Non-linear Activation Functions allow backpropagation because the derivative
function would be related to the input, and it is possible to go back and understand
which weights in the input neurons can provide a better prediction.
Non-linear Activation Functions allow the stacking of multiple layers of neurons as
the output would be a non-linear combination of input passed through multiple layers.
Any output can be represented as a functional computation in a neural network.
Sigmod
Tanh
ReLU
Leaky ReLU
Parametric ReLU
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
ELU
Softmax
Swish
GELU
SELU
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Both tanh and logistic sigmoid activation functions are used in feed-forward network.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Data is centered around zero for tanh meaning, Mean of the input data is zero.
Training of the neural network converges faster, if the inputs to the neurons in each layer
have a mean of zero and a variance of 1 and decorrelated.
Since the input to each layer comes from the previous layer, it is important that the output
of the previous layers (input to the next layers) are centered around zero.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
The ReLU is the most used activation function. Since, it is used in almost all the
convolutional neural networks.
The ReLU is half rectified (from bottom). R(z) is zero when z is less than zero and
R(z) is equal to z when z is above or equal to zero.
Range: [ 0 to infinity)
Any negative input given to the ReLU activation function turns the value into zero
immediately in the graph, which in turns affects the resulting graph by not mapping
the negative values appropriately.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Sigmoid functions and their combinations generally work better in the case of
classifiers.
Sigmoid and tanh functions are sometimes avoided due to the vanishing gradient
problem.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
ReLU function is a general activation function and is used in most of the cases. ReLu
is less computationally expensive than tanh and sigmoid because it involves
simpler mathematical operations and activates only few neurons.
If we encounter a case of dead neurons in our networks the leaky ReLU function is
the best choice.
Always keep in mind that ReLU function should only be used in the hidden
layers. At current time, ReLu works most of the time as a general approximator
Variants of ReLU - Leaky ReLU, Parametric ReLU, and Exponential Linear Unit.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Activation Function: -
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
5. Loss Functions: -
A loss function is a mathematical function that measures the difference between the
predicted output of a model and the actual target values.
It provides a measure of how well a model is performing, guiding the optimization
process by adjusting model parameters to minimize this difference.
The Mean Squared Error (MSE) Loss is one of the most widely used loss
functions for regression tasks.
It calculates the average of the squared differences between the predicted
values and the actual values.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Where:
n is the number of samples in the dataset
yᵢ is the predicted value for the i-th sample
ȳ is the target value for the i-th sample
Advantage: -
Simple to compute and understand.
Differentiable, making it suitable for gradient-based optimization algorithms.
Disadvantage: -
Sensitive to outliers because the errors are squared, which can disproportionately
affect the loss.
Where:
n is the number of samples in the dataset
yᵢ is the predicted value for the i-th sample
ȳ is the target value for the i-th sample
Advantage: -
Less sensitive to outliers compared to MSE.
Simple to compute and interpret.
Disadvantage: -
Not differentiable at zero, which can pose issues for some optimization algorithms.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
3. Huber Loss: -
Huber Loss combines the advantages of MSE and MAE. It is less sensitive to outliers
than MSE.
Where: -
n: The number of data points.
y: The actual value (true value) of the data point.
ŷ: The predicted value returned by the model.
δ: Defines the point where the Huber loss transitions from quadratic to linear.
Advantage: -
Robust to outliers, providing a balance between MSE and MAE.
Differentiable, facilitating gradient-based optimization.
Disadvantage: -
Requires tuning of the parameter δ.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
where n is the number of data points, yi is the actual binary label (0 or 1), and y^i is the
predicted probability.
Advantage: -
Suitable for binary classification.
Differentiable, making it useful for gradient-based optimization.
Disadvantage: -
It can be sensitive to imbalanced datasets.
where n is the number of data points, k is the number of classes, yij is the binary
indicator (0 or 1) if class label j is the correct classification for data point i, and y^ij
is the predicted probability for class j.
Advantage: -
Suitable for multiclass classification.
Differentiable and widely used in neural networks.
Disadvantage: -
Not suitable for sparse targets.
6. Gradient Descent: -
Gradient Descent is an optimization algorithm used to minimize the loss function
in deep learning by updating the model's parameters iteratively.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
This entire procedure is known as Gradient descent, which is also known as steepest
descent. The main objective of using a gradient descent algorithm is to minimize
the cost function using iteration.
The goal of the gradient descent algorithm is to minimize the given function. To
achieve this goal, it performs two steps iteratively:
Compute the first-order derivative of the function to compute the gradient
or slope of that function.
Move away from the direction of the gradient, which means slope
increased from the current point by alpha times, where Alpha is defined as
Learning Rate. It is a tuning parameter in the optimization process which
helps to decide the length of the steps.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
A feedforward neural network is one of the simplest types of artificial neural networks
where connections between the nodes do not form cycles.
The network consists of an input layer, one or more hidden layers, and an output layer.
In this network, the information moves in only one direction forward from the input
nodes, through the hidden nodes to the output nodes.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
The architecture of a feedforward neural network consists of three types of layers: the
input layer, hidden layers, and the output layer. Each layer is made up of units known
as neurons, and the layers are interconnected by weights.
Input Layer: This layer consists of neurons that receive inputs and pass them on to
the next layer. The number of neurons in the input layer is determined by the
dimensions of the input data.
Hidden Layers: These layers are not exposed to the input or output and can be
considered as the computational engine of the neural network. Each hidden layer's
neurons take the weighted sum of the outputs from the previous layer, apply
an activation function, and pass the result to the next layer. The network can have zero
or more hidden layers.
Output Layer: The final layer that produces the output for the given inputs. The
number of neurons in the output layer depends on the number of possible outputs the
network is designed to produce.
Pattern recognition
Classification tasks
Regression analysis
Image recognition
Time series prediction
Network Structure:
Input Layer: 2 neurons
Hidden Layer: 2 neurons (ReLU activation)
Output Layer: 1 neuron (Sigmoid activation)
Weights and Biases:
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
9. Hyper parameters: -
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
DropOut:-
Deep learning neural networks are likely to quickly overfit a training dataset with few
examples.
A larger/deeper NN is also likely to overfit and hence poor generalization.
Dropout is a regularization method used to prevent model overfitting.
It simulates a large number of different network architectures from a single model by
randomly dropping out few neurons from each layer during each training iteration.
It is a very computationally cheap and remarkably effective regularization method to
reduce overfitting and improve generalization error in deep neural networks of all
kinds.
It can be used with most types of layers, such as dense fully connected layers,
convolutional layers, and recurrent layers such as the long short-term memory network
layer.
Dropout may be implemented on any or all hidden layers in the network as well as the
visible or input layer. It is not used on the output layer.
The term “dropout” refers to dropping out units (hidden and visible) in a neural
network.
Dropout is not used after training when making a prediction with the fit network.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
The dropout hyperparameter specifies the probability at which outputs of the layer are
dropped out (inversely, the propability at which inputs to the layers are retained)
A small dropout value of 20%-50% of neurons is generally used.
A common value is a probability of 0.5 for retaining the output of each node in a hidden
layer(dropout is 0.5) and a value close to 1.0, such as 0.8, for retaining inputs from the
visible layer (dropout is 0.2)
The weights of the network will be larger than the normal because of dropout.
Hence weights are scaled down using the chosen dropout rate.
The network can then be used as per normal to make predictions.
Causes of Overfitting: -
1. Insufficient Training Data: -
When the dataset is small, the model might memorize specific patterns
instead of learning general features.
3. Lack of Regularization: -
Without regularization techniques, the model becomes prone to overfitting.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
What is a gradient ?
A gradient is a measure of how much the output variable changes for a
small change in the input.
The gradient is used to update/learn the model parameters — weights and
biases.
The parameter updation rule is:-
if the derivative term in the above equation is too small, there will be
very small change in Wx.
Hence new and old weights are almost same. No learning.
The weights of the initial layers would continue to remain unchanged (or
only change by a negligible amount), no matter how many epochs you run
with the backpropagation algorithm.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
As more layers using certain activation functions are added to neural networks,
the gradients of the loss function approach zero, making the network hard to
train.
Certain activation functions, like the sigmoid function, squishes a large input
space into a small input space between 0 and 1.
Therefore, a large change in the input of the sigmoid function will cause a small
change in the output. Hence, the derivative becomes small.
when the inputs of the sigmoid function becomes larger or smaller (when |x|
becomes bigger), the derivative becomes close to zero. Vanishing Gradient
Problem.
Ways to detect whether your deep network is suffering from the vanishing
gradient problem: -
The model will improve very slowly during the training phase and it is also
possible that training stops very early, meaning that any further training does
not improve the model.
The weights closer to the output layer of the model would witness more of a
change whereas the layers that occur closer to the input layer would not change
much (if at all).
Model weights shrink exponentially and become very small when training the
model.
The model weights become 0 in the training phase.
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning
Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.