KEMBAR78
Deep Learning Module-01 Notes | PDF | Deep Learning | Machine Learning
0% found this document useful (0 votes)
141 views69 pages

Deep Learning Module-01 Notes

This document provides an introduction to deep learning, defining it as a subset of machine learning that utilizes artificial neural networks with multiple hidden layers to learn complex data representations. It contrasts deep learning with traditional machine learning, highlighting its efficiency with large datasets and its ability to automate feature extraction. The document also outlines the architecture of deep neural networks, their applications across various fields, and the fundamental components of perceptrons and multi-layer perceptrons.

Uploaded by

Numan Maniyar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views69 pages

Deep Learning Module-01 Notes

This document provides an introduction to deep learning, defining it as a subset of machine learning that utilizes artificial neural networks with multiple hidden layers to learn complex data representations. It contrasts deep learning with traditional machine learning, highlighting its efficiency with large datasets and its ability to automate feature extraction. The document also outlines the architecture of deep neural networks, their applications across various fields, and the fundamental components of perceptrons and multi-layer perceptrons.

Uploaded by

Numan Maniyar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Deep Learning Introduction to Deep Learning

Module 1
Introduction to Deep Learning

1. Fundamentals of Deep Learning: -

1.1 Definition of Deep Learning: -


 Deep Learning is a subset of Machine Learning that uses artificial neural
networks with multiple hidden layers to automatically learn hierarchical
representations of data.
 It is inspired by the structure and functioning of the human brain's neural
networks.
 These networks consist of layers of interconnected nodes that process
information.
 The more layers, the "deeper" the network, allowing it to learn more complex
features and perform more sophisticated tasks.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

1.2 Why Deep learning?


 When the volume of data increases, Machine learning techniques become
inefficient in terms of performance and accuracy no matter how optimized,
whereas Deep learning performs so much better in such cases.

1.3 How Deep Learning Differs from Traditional Machine Learning?


 Machine learning has significantly advanced the field of artificial intelligence,
deep learning takes it even further by automating many tasks that traditionally
required human intervention, especially in feature extraction.
 Deep learning is a specialized branch of machine learning that uses neural
networks with three or more layers often called deep neural networks.
 These networks are designed to mimic the structure and functioning of the
human brain—although they are still far from replicating its full complexity.
 Deep learning’s strength lies in its ability to automatically discover complex
patterns and features directly from raw data, making it especially powerful
when working with large-scale datasets like images, speech, and text.

1.4 How Deep Learning Works?


 Deep learning uses feature extraction to recognize similar features of the same
label and then uses decision boundaries to determine which features
accurately represent each label.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 For example, In the cats and dogs classification, the deep learning models will
extract information such as the eyes, face, and body shape of animals and
divide them into two classes.
 The deep learning model consists of deep neural networks. The simple neural
network consists of an input layer, a hidden layer, and an output layer.
 Deep learning models consist of multiple hidden layers, with additional layers
that the model's accuracy has improved.

 In a deep learning model, the input layer receives the raw data — such as image
pixels or text. This data is passed through multiple hidden layers, where each
layer gradually learns more refined and abstract features.
 The early layers detect basic patterns, like edges or shapes, while deeper layers
focus on more complex patterns, like object parts.
 Finally, the output layer uses the information extracted by the hidden layers to
make a prediction, such as classifying an image as a dog instead of a cat.

1.5 Deep Neural Network (DNN):-


 A Deep Neural Network (DNN) is a type of artificial neural network (ANN) that
consists of multiple layers between the input and output layers.
 It is designed to model complex patterns and relationships in data by learning
hierarchical representations.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

1.5.1 Structure of a Deep Neural Network: -


A typical DNN consists of the following layers: -
 Input Layer: It accepts raw data (e.g., images, text, numerical values).
 Hidden Layers: These layers perform computations and transformations using
artificial neurons which is also called nodes. The more hidden layers, the
"deeper" the network.
 Output Layer: It produces the final result (e.g., classification, regression).

1.6 Deep Learning Applications: -


Deep learning can be used in a wide variety of applications, including:-

1. Image recognition: To identify objects and features in images, such as people,


animals, places, etc.
2. Natural language processing: To help understand the meaning of text, such as
in customer service chatbots and spam filters.
3. Finance: To help analyze financial data and make predictions about market
trends.
4. Text to image: Convert text into images, such as in the Google Translate app.
5. Fraud detection: Deep learning algorithms can identify security issues to help
protect against fraud.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

6. Customer Service: It helps online and interacted with a chatbot to utilized a


virtual assistant on your smartphone. Deep learning allows these systems to
learn over time to respond.
7. Facial Recognition: An area of deep learning known as computer vision allows
deep learning algorithms to recognize specific features in pictures and videos.
8. Self-driving vehicles: Autonomous vehicles use deep learning to learn how to
operate and handle different situations while driving, and it allows vehicles to
detect traffic lights, recognize signs, and avoid pedestrians.
9. Health Care: Deep learning in the health care industry serves multiple purposes
such as medical images analysis and helping doctors diagnose patients by
detecting cancer cells.
10. Predictive analytics: - Deep learning models can analyze large amounts of
historical information to make accurate predictions about the future.
Predictive analytics helps businesses in several aspects, including forecasting
revenue, product development, decision-making, and manufacturing.

1.7 Architecture of Deep Learning: -

1.7.1 Shallow neural network: - The Shallow neural network has only one hidden layer
between the input and output as shown in the below figure.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

1.7.2 Deep Neural Networks: - It is a neural network that incorporates the complexity
of a certain level, which means several numbers of hidden layers are encompassed in
between the input and output layers. They are highly proficient on model and process
non-linear associations.

1.8 Machine Learning Vs Deep Learning: -

 Machine Learning and Deep Learning are the two main concepts of Data
Science and the subsets of Artificial Intelligence.
 Most of the people think the machine learning, deep learning, and as well as
artificial intelligence as the same buzzwords. But in actuality, all these terms are
different but related to each other.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

1.9 How Machine Learning Works?


The working of machine learning models can be understood by the example of
identifying the image of a cat or dog. To identify this, the ML model takes images of
both cat and dog as input, extracts the different features of images such as shape,
height, nose, eyes, etc., applies the classification algorithm, and predicts the output.
Machine Learning Process:
 Input → Feature Extraction → Classification → Output

1.10 How Deep Learning Works?


The working of deep learning models can be understood with the same
example of identifying cat vs. dog. The deep learning model takes the images as the
input and feeds them directly to the algorithms without requiring any manual feature
extraction step. The images pass through the different layers of the artificial neural
network and predict the final output.
Deep Learning Process:
 Input → Feature Extraction + Classification → Output

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Machine Learning V/s Deep Learning


1. Machine learning uses statistical 1. Deep Learning uses artificial neural
algorithms to learn the hidden patterns and network architecture to learn the hidden
relationships in the dataset. patterns and relationships in the dataset.
2. It can work on the smaller amount of 2. It requires the larger volume of dataset
dataset. compared to machine learning.
3. Better for complex task like image
3. Better for the low-label task.
processing, natural language processing, etc.
4. It takes less time to train the model. 4. It takes more time to train the model.
5. A model is created by relevant features 5. Relevant features are automatically
which are manually extracted from images to extracted from images. It is an end-to-end
detect an object in the image. learning process.
6. Less complex and easy to interpret the 6. More complex, it works like the black box
result. interpretations of the result are not easy.
7. It can work on the CPU or requires less
7. It requires a high-performance computer
computing power as compared to deep
with GPU.
learning.

1.11 Some DL Architecture: -

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

1.12 Designing a Neural Network: -


Designing a neural network involves several key steps, including selecting the
architecture, defining hyperparameters, training the model, and evaluating its
performance. Below is a structured approach to designing an effective neural network

 Define the Problem


 Choose the Neural Network Architecture
 Select Activation Functions
 Configure the Loss Function & Optimization Algorithm
 Set Hyperparameters
 Train the Neural Network
 Evaluate the Model
 Fine-Tune the Model
Note: - Movement of information in a Neural Network happens in two stages: -
1) Feedforward propagation
2) Backpropagation

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

2. Perceptron: -
 The Perceptron is a basic building block of an Artificial neural network,
introduced by Frank Rosenblatt in 1957.
 A perceptron is a binary classifier that has one layer of input nodes and it directly
produces output.
 The inputs are multiplied by weights and the weighted sum is calculated.
 A bias term b is added to the weighted sum.
 The result is passed through an activation function commonly a step function or
a threshold function.
 The output is binary(0 or 1) based on the result of the activation.

Types of Perceptron: -
1. Single layer perceptron
2. Multi-layer perceptron

1. Single layer Perceptron: -


 A single-layered perceptron model consists of feed-forward network and also
includes a threshold transfer function inside the model.
 The main objective of the single-layer perceptron model is to analyze the
linearly separable objects with binary outcomes.

2. Multi-layer Perceptron: -
 A multi-layer perceptron model also has the same model structure but has a
greater number of hidden layers.
 The multi-layer perceptron model is also known as the Backpropagation
algorithm, which executes in two stages as follows:
-> Forward Stage
-> Backward Stage

I. Forward Stage: Activation functions start from the input layer in the forward
stage and terminate on the output layer.
II. Backward Stage: It refers to the backpropagation process, where the network
adjusts its weights based on the error from the output layer. This stage helps the
perceptron to learn and improve its predictions.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

2.1 Basic Components of Perceptron: -


Mr. Frank Rosenblatt invented the perceptron model as a binary classifier
which contains three main components. These are as follows:

1. Input Layer: - This is the primary component of Perceptron which accepts the initial data
into the system for further processing. Each input node contains a real numerical value.

2. Weights and Bias: - Weight parameter represents the strength of the connection between
units. This is another most important parameter of Perceptron components. Bias can be
considered as the line of intercept in a linear equation.

3. Activation Function: - The activation function does the non-linear transformation to the
input making it capable to learn and perform more complex tasks.

2.2 How Perceptron works: -

1. Perceptron is considered as a single-layer neural network that consists of four main


parameters named input values (Input nodes), weights and Bias, net sum, and an activation
function.

2. The perceptron model begins with the multiplication of all input values and their weights,
then adds these values together to create the weighted sum.

3. The weighted sum is applied to the activation function 'f' to obtain the desired output.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

4. This activation function is also known as the step function and is represented by 'f'.

5. This Activation function plays a vital role in ensuring that output is mapped between
required values (0,1) or (-1,1).

2.3 Perceptron model works in two important steps as follows:-

Step-1: - In the first step first, multiply all input values with corresponding weight values and
then add them to determine the weighted sum. Mathematically, we can calculate the weighted
sum as follows:

∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Add a special term called bias 'b' to this weighted sum to improve the model's performance.

∑wi*xi + b

Step-2: - In the second step, an activation function is applied with the above-mentioned
weighted sum, which gives us output either in binary form or a continuous value as follows:

Y = f (∑wi*xi + b)

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

3. Multilayer Perceptron: -
1. Multi-Layer Perceptron (MLP) is an artificial neural network widely used for solving
classification and regression tasks.

2. A Multi-Layer Perceptron is a type of Feed forward neural network with multiple neurons
arranged in layers.

3. MLP consists of fully connected dense layers that transform input data from one dimension
to another. It is called “multi-layer” because it contains an input layer, one or more hidden
layers, and an output layer.

4. The purpose of an MLP is to model complex relationships between inputs and outputs,
making it a powerful tool for various machine learning tasks.

5. The above diagram shows the pictorial representation of MLP. Every connection in the
diagram is a representation of the fully connected nature of an MLP.

6. This means that every node in one layer connects to every node in the next layer. The data
moves through the network, each layer transforms it until the final output is generated in the
output layer.

3.1 Key Components of Multi-Layer Perceptron (MLP):-

 Input Layer: Each neuron or node in this layer corresponds to an input feature. For
instance, if you have three input features, the input layer will have three neurons.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 Hidden Layers: An MLP can have any number of hidden layers, with each layer
containing any number of nodes. These layers process the information received from
the input layer.

 Output Layer: The output layer generates the final prediction or result. If there are
multiple outputs, the output layer will have a corresponding number of neurons.

3.2 Multi-Layer Perceptron Learning Algorithm: -

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Solved - Example Problem-01:-

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Solved - Example Problem-02:-

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

4. Optimizing Perceptions using Activation Functions: -

4.1 Activation Function:-

 An Activation function decides whether a neuron should be activated or not.


 Whether the input that the neuron is receiving is relevant to the given prediction
or should it be ignored.
 Input to the activation function is

 The activation function is the non-linear transformation that we do over the input
signals of hidden neurons.
 This transformed output is then sent to the next layer of neurons as input.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 A neural network without an activation function is essentially just a linear


regression model.
 The activation function does the non-linear transformation to the input making
it capable to learn and perform more complex tasks. This is applied to the hidden
neurons.

4.2 Need for Activation Function: -

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

4.3 Types of Activation function with Neural Network: -

The Activation Functions can be basically divided into different types-


1. Binary Step Function
2. Linear Activation Function
3. Non-Linear Activation Function

4.3.1. Binary Step Function:-

 A binary step function is a threshold-based activation function.


 It uses a threshold to decide whether a neuron should be activated or not
 If the input to the activation function (Y) is above (or below) a certain threshold,
the neuron is activated and sends exactly the same signal to the next layer.
 Otherwise, the neuron is not activated. I.e., signal is not passed to the next layer.

Activation function f(x) =


“activated”
if Y > threshold else not
Alternatively, f(x) = 1 if Y>
threshold, 0 otherwise

 Disadvantages of Binary Step Functions: -


 They don't provide multi-value outputs – not suitable for multi-class
classification.
 The gradient of the step function is zero, this introduces some problem in
the backpropagation process.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

4.3.2. Linear Activation Function:-

 The linear activation function also known as no activation or identity


function.
 In Linear Activation Function, the dependent Variable has a direct, proportional
relationship with the independent variable. The output is proportional to the
input.
 It doesn’t help with the complexity or various parameters of usual data that is
fed to the neural networks.

Equation : f(x) = x
Range : (-infinity to infinity)

Note: - The output of the functions will not be confined between any range.

 Disadvantages of Linear Activation Function:-

 The gradient of the function doesn't involve the input (x).


 Hence it is difficult during backpropagation to identify the neuron's
whose weight have to be adjusted.
 The neuron passes the signal as it is to the next layer.
 The last layer will be a linear function of the first layer.
 This linear activation function is generally used by the neurons in the
input layer of NN.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

4.3.3. Non -Linear Activation Function: -

 The Nonlinear Activation Functions are the most used activation functions.
 It makes it easy for the model to generalize or adapt with variety of data and to
differentiate between the output.
 The Non-linear Activation Functions are mainly divided on the basis of their range or
curves.
 Non-linear Activation Functions allow backpropagation because the derivative
function would be related to the input, and it is possible to go back and understand
which weights in the input neurons can provide a better prediction.
 Non-linear Activation Functions allow the stacking of multiple layers of neurons as
the output would be a non-linear combination of input passed through multiple layers.
 Any output can be represented as a functional computation in a neural network.

Non- linear Neural Networks Activation Functions:-

 Sigmod
 Tanh
 ReLU
 Leaky ReLU
 Parametric ReLU

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 ELU
 Softmax
 Swish
 GELU
 SELU

 Advantages of Linear Activation Function:-

 The gradient of the function involves input 'x'.


 Hence it is easy to understand which weights of the input neurons have
to be adjusted, during backpropagation to give a better prediction.

4.3.3.1 Sigmoid or Logistic Activation Function: -

Input : a real number


Output : a number between 0 to 1

The main reason why we use sigmoid function is


because it exists between (0 to 1). Therefore, it is
especially used for models where we have to predict the probability as an output. Since
probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 Disadvantages of Sigmoid Activation Function:-

 The gradient of the


function has a significant
value, only for inputs
between 3 and –3.
 For inputs out of this
range, the gradient is small,
and eventually it becomes
zero.
 The network stops
learning and suffers from
vanishing gradient problem.

4.3.3.2 Tanh or hyperbolic tangent Activation Function: -

 The output range of the tanh


function is from (-1 to 1). tanh
is also sigmoidal (s - shaped).
 Tanh is zero centered.
 Negative inputs are mapped
strongly negative.
 Positive inputs are mapped
strongly positive.
 Zero inputs are mapped near

Fig: tanh v/s Logistic Sigmoid Zero

 Both tanh and logistic sigmoid activation functions are used in feed-forward network.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 Disadvantages of Tanh Activation Function:-


• Gradient is very steep, but
eventually becomes zero.
• The network stops learning
and suffers from vanishing gradient
problem.
• But tanh is zero centered and
the gradients move in all directions.
• Hence tanh non-linearity is
preferred over sigmoid.

 Comparison of Sigmoid and Tanh Activation Functions:-

For integers between –6 to + 6

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 Data is centered around zero for tanh meaning, Mean of the input data is zero.
 Training of the neural network converges faster, if the inputs to the neurons in each layer
have a mean of zero and a variance of 1 and decorrelated.
 Since the input to each layer comes from the previous layer, it is important that the output
of the previous layers (input to the next layers) are centered around zero.

4.3.3.3 ReLU (Rectified Linear Unit) Activation Function: -.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 The ReLU is the most used activation function. Since, it is used in almost all the
convolutional neural networks.
 The ReLU is half rectified (from bottom). R(z) is zero when z is less than zero and
R(z) is equal to z when z is above or equal to zero.
 Range: [ 0 to infinity)
 Any negative input given to the ReLU activation function turns the value into zero
immediately in the graph, which in turns affects the resulting graph by not mapping
the negative values appropriately.

 Disadvantages of ReLU Activation Function:-

 For negative inputs, the gradient is zero.


 Hence during backpropagation, the weights and bias of some neurons are not
updated.
 This creates dead neurons, which never get activated.
 This is known as "Dying ReLU problem"

4.3.3.4 Leaky ReLU/Parametric ReLu Activation Function:-

 It is an attempt to solve the dying ReLU problem.


 The gradient has a slope for negative inputs.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 The leak helps to increase the range of the ReLU function.


 Usually, the value of a is 0.1 (Leaky ReLU) or some other value a
 When a is not 0.01 then it is called Randomized/Parametric ReLU f(x) = max(αx, x)
 Therefore the range of the Leaky ReLU is (-infinity to infinity).

 Advantages and Disadvantages of Leaky ReLU Activation Function:-


 For negative inputs, the gradient is a non-zero value.
 Hence during backpropagation, the weights and bias of all neurons are updated. No
dead neurons.
 The predictions made for negative inputs are not consistent.
 Since the gradient is a very small value for negative inputs, learning of model
parameters is time consuming.

 Sigmoid functions and their combinations generally work better in the case of
classifiers.
 Sigmoid and tanh functions are sometimes avoided due to the vanishing gradient
problem.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 ReLU function is a general activation function and is used in most of the cases. ReLu
is less computationally expensive than tanh and sigmoid because it involves
simpler mathematical operations and activates only few neurons.
 If we encounter a case of dead neurons in our networks the leaky ReLU function is
the best choice.
 Always keep in mind that ReLU function should only be used in the hidden
layers. At current time, ReLu works most of the time as a general approximator
 Variants of ReLU - Leaky ReLU, Parametric ReLU, and Exponential Linear Unit.

4.3.3.5 SoftMax Activation Function:-

 Softmax is an activation function


that scales numbers/logits into
probabilities.
 The output of a Softmax is a
vector (say v) with probabilities of
each possible outcome.
 The probabilities in vector v sum
to one for all possible outcomes or
classes.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 Activation Function: -

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

5. Loss Functions: -
 A loss function is a mathematical function that measures the difference between the
predicted output of a model and the actual target values.
 It provides a measure of how well a model is performing, guiding the optimization
process by adjusting model parameters to minimize this difference.

 Types of Loss Functions: -

1. Regression Loss Functions: -


 Mean Squared Error (MSE)
 Mean Absolute Error (MAE)
 Huber Loss

2. Classification Loss Functions: -


 Binary Cross-Entropy Loss (Log Loss)
 Categorical Cross-Entropy Loss

5.1 Why are Loss Functions Important?


Loss functions play a crucial role in deep learning as they serve as the foundation for
training and optimizing neural networks and also, they are important because of –
 Guide Model Training
 Enable Learning Through Backpropagation
 Help in Model Performance Evaluation
 Adapt to Different Problem Types
 Control Sensitivity to Errors
 Impact on Model Convergence

 Regression Loss Functions: -


1. Mean Squared Error (MSE) /Squared loss/ L2 loss:-

 The Mean Squared Error (MSE) Loss is one of the most widely used loss
functions for regression tasks.
 It calculates the average of the squared differences between the predicted
values and the actual values.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 The mathematical equation for Mean Squared Error (MSE) :-

Where:
 n is the number of samples in the dataset
 yᵢ is the predicted value for the i-th sample
 ȳ is the target value for the i-th sample

 Advantage: -
 Simple to compute and understand.
 Differentiable, making it suitable for gradient-based optimization algorithms.

 Disadvantage: -
 Sensitive to outliers because the errors are squared, which can disproportionately
affect the loss.

2. Mean Absolute Error/ L1 loss Functions: -


 The Mean Absolute Error (MAE) Loss is another commonly used loss function for
regression.
 It calculates the average of the absolute differences between the predicted values
and the actual values.
 The mathematical equation for Mean Absolute Error (MAE)

Where:
 n is the number of samples in the dataset
 yᵢ is the predicted value for the i-th sample
 ȳ is the target value for the i-th sample

 Advantage: -
 Less sensitive to outliers compared to MSE.
 Simple to compute and interpret.

 Disadvantage: -
 Not differentiable at zero, which can pose issues for some optimization algorithms.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

3. Huber Loss: -
 Huber Loss combines the advantages of MSE and MAE. It is less sensitive to outliers
than MSE.

Where: -
 n: The number of data points.
 y: The actual value (true value) of the data point.
 ŷ: The predicted value returned by the model.
 δ: Defines the point where the Huber loss transitions from quadratic to linear.

 Advantage: -
 Robust to outliers, providing a balance between MSE and MAE.
 Differentiable, facilitating gradient-based optimization.

 Disadvantage: -
 Requires tuning of the parameter δ.

 Classification Loss Functions: -


1. Binary Cross Entropy/log loss Functions: -
 Binary Cross-Entropy Loss also known as Log Loss. It is used for binary
classification problems.
 It measures the performance of a classification model whose output is a
probability value between 0 and 1.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 The mathematical equation for Binary Cross Entropy: -

where n is the number of data points, yi is the actual binary label (0 or 1), and y^i is the
predicted probability.
 Advantage: -
 Suitable for binary classification.
 Differentiable, making it useful for gradient-based optimization.

 Disadvantage: -
 It can be sensitive to imbalanced datasets.

2. Categorical Cross-Entropy Loss: -


 Categorical Cross-Entropy Loss is used for multiclass classification problems.
 It measures the performance of a classification model whose output is a probability
distribution over multiple class.
 The mathematical equation for categorical cross-entropy loss.

 where n is the number of data points, k is the number of classes, yij is the binary
indicator (0 or 1) if class label j is the correct classification for data point i, and y^ij
is the predicted probability for class j.

 Advantage: -
 Suitable for multiclass classification.
 Differentiable and widely used in neural networks.

 Disadvantage: -
 Not suitable for sparse targets.

6. Gradient Descent: -
 Gradient Descent is an optimization algorithm used to minimize the loss function
in deep learning by updating the model's parameters iteratively.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 It plays a crucial role in training neural networks by adjusting weights to reduce


the error between predicted and actual outputs.
 It helps in finding the local minimum of a function.
 The best way to define the local minimum or local maximum of a function using
gradient descent is as follows: -
 If we move towards a negative gradient or away from the gradient of the
function at the current point, it will give the local minimum of that function.

 Whenever we move towards a positive gradient or towards the gradient of


the function at the current point, we will get the local maximum of that
function.

 This entire procedure is known as Gradient descent, which is also known as steepest
descent. The main objective of using a gradient descent algorithm is to minimize
the cost function using iteration.
 The goal of the gradient descent algorithm is to minimize the given function. To
achieve this goal, it performs two steps iteratively:
 Compute the first-order derivative of the function to compute the gradient
or slope of that function.
 Move away from the direction of the gradient, which means slope
increased from the current point by alpha times, where Alpha is defined as
Learning Rate. It is a tuning parameter in the optimization process which
helps to decide the length of the steps.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

6.1 Types of Gradient Descent: -


Gradient Descent learning algorithm can be divided into 3 types:-
 Batch gradient descent
 Stochastic gradient descent
 Mini-batch gradient descent

6.1.1 Batch gradient descent: -


 Batch gradient descent (BGD) is used to find the error for each point in the training set
and update the model after evaluating all training examples. This procedure is known
as the training epoch.
 In simple words, it is a greedy approach where we have to sum over all examples for
each update.

 Advantages of Batch gradient descent: -


 It produces less noise in comparison to other gradient descent.
 It produces stable gradient descent convergence.
 It is Computationally efficient as all resources are used for all training samples.

6.1.2 Stochastic gradient descent: -


 Stochastic gradient descent (SGD) is a type of gradient descent that runs one training
example per iteration.
 As it requires only one training example at a time, hence it is easier to store in allocated
memory.
 It shows some computational efficiency losses in comparison to batch gradient
systems.
 It can be helpful in finding the global minimum and also escaping the local minimum.

 Advantages of Stochastic gradient descent: -


 It is easier to allocate in desired memory.
 It is relatively fast to compute than batch gradient descent.
 It is more efficient for large datasets.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

6.1.3 Mini-batch gradient descent: -


 Mini Batch gradient descent is the combination of both batch gradient descent and
stochastic gradient descent.
 It divides the training datasets into small batch sizes then performs the updates on those
batches separately.
 Splitting training datasets into smaller batches make a balance to maintain the
computational efficiency of batch gradient descent and speed of stochastic gradient
descent.
 It achieves a special type of gradient descent with higher computational efficiency and
less noisy gradient descent.

 Advantages of Stochastic gradient descent: -


 It is easier to fit in allocated memory.
 It is computationally efficient.
 It produces stable gradient descent convergence.

7. Feedforward Neural Network: -

 A feedforward neural network is one of the simplest types of artificial neural networks
where connections between the nodes do not form cycles.
 The network consists of an input layer, one or more hidden layers, and an output layer.
 In this network, the information moves in only one direction forward from the input
nodes, through the hidden nodes to the output nodes.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

7.1 Architecture of Feedforward Neural Networks: -

The architecture of a feedforward neural network consists of three types of layers: the
input layer, hidden layers, and the output layer. Each layer is made up of units known
as neurons, and the layers are interconnected by weights.
 Input Layer: This layer consists of neurons that receive inputs and pass them on to
the next layer. The number of neurons in the input layer is determined by the
dimensions of the input data.
 Hidden Layers: These layers are not exposed to the input or output and can be
considered as the computational engine of the neural network. Each hidden layer's
neurons take the weighted sum of the outputs from the previous layer, apply
an activation function, and pass the result to the next layer. The network can have zero
or more hidden layers.
 Output Layer: The final layer that produces the output for the given inputs. The
number of neurons in the output layer depends on the number of possible outputs the
network is designed to produce.

 Applications of Feedforward Neural Networks: -


Feedforward neural networks are used in a variety of deep learning tasks including:

 Pattern recognition
 Classification tasks
 Regression analysis
 Image recognition
 Time series prediction

1. Problem on Feedforward Neural Networks: -

Network Structure:
 Input Layer: 2 neurons
 Hidden Layer: 2 neurons (ReLU activation)
 Output Layer: 1 neuron (Sigmoid activation)
Weights and Biases:

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

8. Training Neural Network with Back-propagation: -


 Backpropagation is a technique used in deep learning to train artificial neural networks
particularly feed-forward networks.
 It works iteratively to adjust weights and bias to minimize the cost function.
 Backpropagation algorithm calculates the gradient of the error function.
 Backpropagation can be written as a function of the neural network.
 Backpropagation algorithms are a set of methods used to efficiently train artificial
neural networks following a gradient descent approach which exploits the chain rule.
8.1 Features of Backpropagation: -
 The main features of Backpropagation are the iterative, recursive and efficient method
through which it calculates the updated weight to improve the network until it is not
able to perform the task for which it is being trained.
 Derivatives of the activation function to be known at network design time is required
to Backpropagation.

8.2 Steps Involved in Training a Neural Network Using Backpropagation: -


1. Initialization: -
 Randomly initialize the weights and biases of the network with small values (e.g.,
between −1 and 1).
 Set the learning rate η.
2. Forward Propagation: -
 In forward propagation, the network makes a prediction using the current weights
and biases.
 Input Layer to Hidden Layer

 Hidden Layer to Output Layer


3. Compute the Loss: -
 The difference between the predicted and actual output is measured using a loss
function.
 For binary classification, the Binary Cross-Entropy Loss is commonly used:

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 For regression, Mean Squared Error (MSE) is used:

4. Backward Propagation (Error Calculation):-


In this step, the network adjusts its weights and biases by propagating the error backward.
Step 1: Compute Gradients for the Output Layer.
Step 2: Compute Gradients for the Hidden Layer.
5. Update Weights and Biases

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

9. Hyper parameters: -

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 DropOut:-
 Deep learning neural networks are likely to quickly overfit a training dataset with few
examples.
 A larger/deeper NN is also likely to overfit and hence poor generalization.
 Dropout is a regularization method used to prevent model overfitting.
 It simulates a large number of different network architectures from a single model by
randomly dropping out few neurons from each layer during each training iteration.
 It is a very computationally cheap and remarkably effective regularization method to
reduce overfitting and improve generalization error in deep neural networks of all
kinds.

 It can be used with most types of layers, such as dense fully connected layers,
convolutional layers, and recurrent layers such as the long short-term memory network
layer.
 Dropout may be implemented on any or all hidden layers in the network as well as the
visible or input layer. It is not used on the output layer.
 The term “dropout” refers to dropping out units (hidden and visible) in a neural
network.
 Dropout is not used after training when making a prediction with the fit network.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 The dropout hyperparameter specifies the probability at which outputs of the layer are
dropped out (inversely, the propability at which inputs to the layers are retained)
 A small dropout value of 20%-50% of neurons is generally used.
 A common value is a probability of 0.5 for retaining the output of each node in a hidden
layer(dropout is 0.5) and a value close to 1.0, such as 0.8, for retaining inputs from the
visible layer (dropout is 0.2)
 The weights of the network will be larger than the normal because of dropout.
 Hence weights are scaled down using the chosen dropout rate.
 The network can then be used as per normal to make predictions.

 The Problem of Overfitting in Deep Learning:-


 Overfitting in deep learning occurs when a model learns the training data too
well, including noise and irrelevant details, leading to poor generalization
performance on new, unseen data.
 This means the model performs well on the training data but fails to make
accurate predictions on real-world data.

 Causes of Overfitting: -
1. Insufficient Training Data: -
 When the dataset is small, the model might memorize specific patterns
instead of learning general features.

2. Excessive Model Complexity: -


 Using too many layers, neurons, or parameters can make the model overly
flexible, capturing noise and random fluctuations.

3. Lack of Regularization: -
 Without regularization techniques, the model becomes prone to overfitting.

4. Longer Training Duration: -


 Prolonged training may lead the model to memorize the training data rather
than generalizing patterns.

5. Imbalanced Data Distribution: -


 If the dataset is not diverse, the model may overfit to dominant patterns.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 The Vanishing and Exploding Gradient Problems:-

 What is a gradient ?
 A gradient is a measure of how much the output variable changes for a
small change in the input.
 The gradient is used to update/learn the model parameters — weights and
biases.
 The parameter updation rule is:-

 if the derivative term in the above equation is too small, there will be
very small change in Wx.
 Hence new and old weights are almost same. No learning.
 The weights of the initial layers would continue to remain unchanged (or
only change by a negligible amount), no matter how many epochs you run
with the backpropagation algorithm.

 Problem of Vanishing Gradient: -

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 As more layers using certain activation functions are added to neural networks,
the gradients of the loss function approach zero, making the network hard to
train.
 Certain activation functions, like the sigmoid function, squishes a large input
space into a small input space between 0 and 1.
 Therefore, a large change in the input of the sigmoid function will cause a small
change in the output. Hence, the derivative becomes small.
 when the inputs of the sigmoid function becomes larger or smaller (when |x|
becomes bigger), the derivative becomes close to zero. Vanishing Gradient
Problem.

 Ways to detect whether your deep network is suffering from the vanishing
gradient problem: -
 The model will improve very slowly during the training phase and it is also
possible that training stops very early, meaning that any further training does
not improve the model.
 The weights closer to the output layer of the model would witness more of a
change whereas the layers that occur closer to the input layer would not change
much (if at all).
 Model weights shrink exponentially and become very small when training the
model.
 The model weights become 0 in the training phase.

 Few Solutions for vanishing gradient problem:-


 Use other activation functions, such as ReLU, which doesn’t cause a small
derivative.
 Residual networks (ResNet):-
 Use bypass/skip connections to bypass information from few layers.
 Using these connections, information can be transferred from layer n to
layer n+t.
 to perform this, the activation function of layer n is connected to the
activation function of n+t.
 This causes the gradient to pass between the layers without any
modification in size.
 Residual connection directly adds the value at the beginning of the block,
x, to the end of the block (F(x)+x)

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.
Deep Learning Introduction to Deep Learning

 This residual connection doesn’t go through activation functions that


“squashes” the derivatives, resulting in a higher overall derivative of the
block.

Mr. Tanveer Ahmed, Asst. Professor, Dept. of Computer Science & Engineering.

You might also like