Module 02: Introduction to Deep Learning
Contents:
Feedforward Neural Networks, Backpropagation, EBPTA, Convergence and local minima,
Gradient Descent (GD), Momentum Based GD, Nesterov Accelerated GD, Stochastic GD,
AdaGrad, RMSProp
Feedforward Neural Networks (FNNs)
Definition:
A Feedforward Neural Network (FNN) is a type of artificial neural network where the flow
of information is strictly unidirectional—from input to output. It is one of the simplest and
most widely used architectures in deep learning. There are no loops or cycles in the network,
and each layer processes and passes information to the next.
Architectures of Feedforward Neural Networks
1. Single-Layer Perceptron (SLP):
o Consists of only one input layer and one output layer.
o Can solve linearly separable problems using an activation function (e.g., step
or sigmoid).
o Limitation: Cannot solve non-linear problems (e.g., XOR problem).
2. Multi-Layer Perceptron (MLP):
o Has an input layer, one or more hidden layers, and an output layer.
o Hidden layers use activation functions like ReLU, Sigmoid, Tanh to model
non-linearity.
o MLPs can solve both linear and non-linear problems.
3. Deep Neural Network (DNN):
o An extension of MLP with a large number of hidden layers.
o Typically used for complex tasks like image recognition, speech processing,
and text analysis.
Key Components of FNNs
1. Input Layer:
o Takes features (data points) as input.
o Example: For an image, the pixel values serve as input features.
2. Hidden Layers:
o Perform computations using weights, biases, and activation functions.
o Example: Transform input features into higher-dimensional spaces. 3. Output
Layer:
o Produces the final predictions or decisions (e.g., classification or regression).
4. Weights and Biases:
o Weights determine the strength of the connection between neurons. o
Biases help shift the activation function, enabling the model to learn better. 5.
Activation Functions:
o Introduce non-linearity to the network.
o Examples: Sigmoid, Tanh, ReLU, Softmax.
Working of Feedforward Neural Networks
1. Forward Propagation:
o The input passes through each layer.
o Each layer applies weights, biases, and activation functions to transform the
data.
2. Output Calculation:
o Final predictions are generated based on the output layer's computations.
3. Error Calculation:
o Compares the predicted output to the actual output using a loss function (e.g.,
Mean Squared Error or Cross-Entropy Loss).
4. Training (using Backpropagation):
o Weights and biases are updated iteratively to minimize the error using
optimization algorithms (e.g., Gradient Descent).
Diagram of Feedforward Neural Networks
Structure:
• Input Layer → Hidden Layers → Output Layer.
• Arrows show data flow from one layer to another.
You can visualize:
• Circles for neurons (nodes).
• Arrows for connections between neurons with weights.
• Different layers separated vertically.
Applications of Feedforward Neural Networks
1. Image Recognition:
o Used in computer vision tasks like facial recognition and object detection.
o Example: Classifying handwritten digits in the MNIST dataset.
2. Speech Recognition:
o Convert audio signals into text.
o Example: Virtual assistants like Siri or Alexa.
3. Natural Language Processing (NLP):
o Sentiment analysis, machine translation, and text classification.
o Example: Spam detection in emails.
4. Regression Tasks:
o Predict continuous values like stock prices or weather conditions.
o Example: House price prediction based on features like area and location.
5. Pattern Recognition:
o Identify patterns in medical images (e.g., detecting tumors).
o Example: MRI and X-ray analysis.
6. Recommendation Systems:
o Suggest products or content based on user preferences.
o Example: Netflix recommending shows.
Examples of Feedforward Neural Networks
1. Single-Layer Perceptron:
o Task: Predict whether an email is spam or not.
o Input: Email features like word count, sender domain, etc.
o Output: Spam (1) or Not Spam (0).
2. Multi-Layer Perceptron:
o Task: Classify handwritten digits (0-9).
o Input: Pixel values of images.
o Hidden Layers: Extract features like edges, shapes, etc.
o Output: Predicted digit.
3. Deep Neural Network:
o Task: Detect faces in an image.
o Input: Pixel matrix of the image.
o Hidden Layers: Learn features like edges, facial structures, and patterns.
o Output: Location and identity of faces in the image.
Backpropagation : is a powerful algorithm in deep learning, primarily used to train artificial
neural networks, particularly feed-forward networks. It works iteratively, minimizing the
cost function by adjusting weights and biases.
In each epoch, the model adapts these parameters, reducing loss by following the error
gradient. Backpropagation often utilizes optimization algorithms like gradient descent or
stochastic gradient descent. The algorithm computes the gradient using the chain rule from
calculus, allowing it to effectively navigate complex layers in the neural network to minimize
the cost function.
Why is Backpropagation Important?
Backpropagation plays a critical role in how neural networks improve over time. Here's why:
1. Efficient Weight Update: It computes the gradient of the loss function with respect
to each weight using the chain rule, making it possible to update weights efficiently.
2. Scalability: The backpropagation algorithm scales well to networks with multiple
layers and complex architectures, making deep learning feasible.
3. Automated Learning: With backpropagation, the learning process becomes
automated, and the model can adjust itself to optimize its performance.
Working of Backpropagation Algorithm
The Backpropagation algorithm involves two main steps: the Forward Pass and
the Backward Pass.
How Does the Forward Pass Work?
In the forward pass, the input data is fed into the input layer. These inputs, combined with
their respective weights, are passed to hidden layers.
For example, in a network with two hidden layers (h1 and h2 as shown in Fig. (a)), the output
from h1 serves as the input to h2. Before applying an activation function, a bias is added to
the weighted inputs.
Each hidden layer applies an activation function like ReLU (Rectified Linear Unit), which
returns the input if it’s positive and zero otherwise. This adds non-linearity, allowing the
model to learn complex relationships in the data. Finally, the outputs from the last hidden
layer are passed to the output layer, where an activation function, such as softmax, converts
the weighted outputs into probabilities for classification.
The forward pass using weights and biases
How Does the Backward Pass Work?
In the backward pass, the error (the difference between the predicted and actual output) is
propagated back through the network to adjust the weights and biases. One common method
for error calculation is the Mean Squared Error (MSE), given by:
MSE=(Predicted Output−Actual Output)2MSE=(Predicted Output−Actual Output)2
Once the error is calculated, the network adjusts weights using gradients, which are
computed with the chain rule. These gradients indicate how much each weight and bias
should be adjusted to minimize the error in the next iteration. The backward pass continues
layer by layer, ensuring that the network learns and improves its performance. The activation
function, through its derivative, plays a crucial role in computing these gradients during
backpropagation.
Error Backpropagation Through Activation (EBPTA)
Introduction
Error Backpropagation Through Activation (EBPTA) is an essential extension of the standard
Backpropagation Algorithm used to train artificial neural networks. It is specifically
designed to enhance how errors are propagated through different activation functions in
multi-layer perceptrons (MLPs).
In traditional Backpropagation, the error is propagated backward through the network using
partial derivatives of the loss function with respect to weights and biases. However, in
EBPTA, additional focus is given to how activation functions influence the error
propagation, ensuring an optimized weight update process.
Steps of EBPTA
Differences Between Standard Backpropagation and EBPTA
Feature Standard EBPTA
Backpropagation
Error Propagation Adjusts based on activation functions
Uses simple derivatives
Activation Handles activation functions
Functions May cause vanishing more effectively
gradients
Weight Updates
Direct gradient-b
Activation-a Faster due to better error flow
Speed Slower for deep
Applications of EBPTA
• Speech Recognition (e.g., Siri, Google Assistant)
• Image Classification (e.g., Face Recognition)
• Autonomous Vehicles (e.g., Tesla’s AI models)
• MedicalDiagnosis (e.g., Disease Prediction)
Convergence and Local Minima in Deep Learning
In deep learning, convergence refers to the process where the optimization algorithm (e.g.,
Gradient Descent) minimizes the loss function, improving the model’s ability to make
accurate predictions. Local minima are points where the loss function has a lower value than
its immediate surroundings but is not necessarily the absolute lowest (global minimum).
Understanding the Loss Function and Optimization
•A loss function quantifies the error between predicted and actual values.
• Optimization algorithms aim to minimize this loss function by adjusting the model’s
parameters (weights and biases).
• The optimization process involves moving in the direction where the loss decreases,
ideally reaching the global minimum (the lowest possible loss).
Local Minima vs. Global Minima
• Local Minimum: A point where the loss function is lower than nearby points but not
necessarily the lowest across the entire function.
• Global Minimum: The absolute lowest value of the loss function across the entire
search space.
• Saddle Points: Points where the gradient is zero, but they are neither local minima
nor maxima.
Challenges in Convergence Due to Local Minima
• Deep learning models often have highly complex loss landscapes with multiple local
minima and saddle points.
• If an optimization algorithm gets stuck in a local minimum, it might not find the best
possible model parameters.
• Some models might suffer from slow convergence, requiring advanced optimization
techniques.
Factors Affecting Convergence
• Learning Rate: If too high, the model may oscillate and never converge; if too low, it
may take too long to reach the minimum.
• Initialization of Weights: Poor initialization may cause slow convergence or getting
stuck in local minima.
• Choice of Activation Function: Some activation functions (e.g., Sigmoid) can cause
vanishing gradients, leading to slow learning.
• Optimization Algorithm Used: Different optimizers (SGD, Adam, RMSProp, etc.)
influence the rate and stability of convergence.
Techniques to Overcome Local Minima
1. Momentum-Based Gradient Descent: Uses past gradients to accelerate learning.
2. Nesterov Accelerated Gradient Descent (NAG): A variant of momentum that better
anticipates updates.
3. Adaptive Learning Rate Methods:
o AdaGrad: Adjusts learning rate based on past gradients.
o RMSProp: Divides learning rate by a moving average of past gradients.
o Adam (Adaptive Moment Estimation): Combines momentum and RMSProp
for faster convergence.
4. Batch Normalization: Normalizes input layers to speed up training.
5. Dropout Regularization: Helps escape local minima by randomly dropping neurons
during training.
6. Early Stopping: Prevents overfitting and stops training when loss no longer
decreases significantly.
(A graph of a loss function with multiple dips, showing global and local minima, and
saddle points.)
Gradient Descent (GD) and Its Types
Gradient Descent (GD) is an optimization algorithm used to minimize the loss function in
machine learning and deep learning models by adjusting the model’s parameters (weights and
biases). It works by computing the gradient (derivative) of the loss function and updating the
parameters in the direction that reduces the error.
which is completely incorrect!
Using these formulas, we compute new values for mmm and ccc iteratively.
Real-World Applications of Gradient Descent
1. Stock Price Prediction
o Companies like Bloomberg and Goldman Sachs use GD to predict stock
market trends.
2. Self-Driving Cars
o Tesla and Waymo use GD in Neural Networks for lane detection and obstacle
recognition.
3. Medical Diagnosis
o AI models in hospitals use GD to detect cancer and diseases from MRI and CT
scans.
4. Chatbots & NLP
o Siri and Google Assistant use GD in training NLP models for voice
recognition.
Types of Gradient Descent
Advanced Variants of Gradient Descent/ Accelerated GD
The two most used accelerated gradient descent methods are:
1. Momentum-Based GD
2. Nesterov Accelerated Gradient (NAG)