KEMBAR78
Unit 1 Part 1 | PDF | Artificial Neural Network | Deep Learning
0% found this document useful (0 votes)
47 views61 pages

Unit 1 Part 1

Uploaded by

storytimess111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views61 pages

Unit 1 Part 1

Uploaded by

storytimess111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Introduction to Deep Learning

In the fast-evolving era of artificial intelligence, Deep Learning stands as a cornerstone technology,
revolutionizing how machines understand, learn, and interact with complex data. At its essence,
Deep Learning AI mimics the intricate neural networks of the human brain, enabling computers to
autonomously discover patterns and make decisions from vast amounts of unstructured data. This
transformative field has propelled breakthroughs across various domains, from computer vision and
natural language processing to healthcare diagnostics and autonomous driving.

Introduction to Deep Learning

As we dive into this introductory exploration of Deep Learning, we uncover its foundational
principles, applications, and the underlying mechanisms that empower machines to achieve human-
like cognitive abilities. This article serves as a gateway into understanding how Deep Learning is
reshaping industries, pushing the boundaries of what’s possible in AI, and paving the way for a future
where intelligent systems can perceive, comprehend, and innovate autonomously.

What is Deep Learning?

The definition of Deep learning is that it is the branch of machine learning that is based on artificial
neural network architecture. An artificial neural network or ANN uses layers of interconnected nodes
called neurons that work together to process and learn from the input data.

In a fully connected Deep neural network, there is an input layer and one or more hidden layers
connected one after the other. Each neuron receives input from the previous layer neurons or the
input layer. The output of one neuron becomes the input to other neurons in the next layer of the
network, and this process continues until the final layer produces the output of the network. The
layers of the neural network transform the input data through a series of nonlinear transformations,
allowing the network to learn complex representations of the input data.
Scope of Deep Learning

Today Deep learning AI has become one of the most popular and visible areas of machine learning,
due to its success in a variety of applications, such as computer vision, natural language processing,
and Reinforcement learning.

Deep learning AI can be used for supervised, unsupervised as well as reinforcement machine
learning. it uses a variety of ways to process these.

 Supervised Machine Learning: Supervised machine learning is the machine


learning technique in which the neural network learns to make predictions or classify data
based on the labeled datasets. Here we input both input features along with the target
variables. the neural network learns to make predictions based on the cost or error that
comes from the difference between the predicted and the actual target, this process is
known as backpropagation. Deep learning algorithms like Convolutional neural networks,
Recurrent neural networks are used for many supervised tasks like image classifications and
recognization, sentiment analysis, language translations, etc.

 Unsupervised Machine Learning: Unsupervised machine learning is the machine


learning technique in which the neural network learns to discover the patterns or to cluster
the dataset based on unlabeled datasets. Here there are no target variables. while the
machine has to self-determined the hidden patterns or relationships within the datasets.
Deep learning algorithms like autoencoders and generative models are used for
unsupervised tasks like clustering, dimensionality reduction, and anomaly detection.

 Reinforcement Machine Learning: Reinforcement Machine Learning is the machine


learning technique in which an agent learns to make decisions in an environment to
maximize a reward signal. The agent interacts with the environment by taking action and
observing the resulting rewards. Deep learning can be used to learn policies, or a set of
actions, that maximizes the cumulative reward over time. Deep reinforcement learning
algorithms like Deep Q networks and Deep Deterministic Policy Gradient (DDPG) are used to
reinforce tasks like robotics and game playing etc.

Artificial neural networks

Artificial neural networks are built on the principles of the structure and operation of human
neurons. It is also known as neural networks or neural nets. An artificial neural network’s input layer,
which is the first layer, receives input from external sources and passes it on to the hidden layer,
which is the second layer. Each neuron in the hidden layer gets information from the neurons in the
previous layer, computes the weighted total, and then transfers it to the neurons in the next layer.
These connections are weighted, which means that the impacts of the inputs from the preceding
layer are more or less optimized by giving each input a distinct weight. These weights are then
adjusted during the training process to enhance the performance of the model.
Fully Connected Artificial Neural Network

Artificial neurons, also known as units, are found in artificial neural networks. The whole Artificial
Neural Network is composed of these artificial neurons, which are arranged in a series of layers. The
complexities of neural networks will depend on the complexities of the underlying patterns in the
dataset whether a layer has a dozen units or millions of units. Commonly, Artificial Neural Network
has an input layer, an output layer as well as hidden layers. The input layer receives data from the
outside world which the neural network needs to analyze or learn about.

In a fully connected artificial neural network, there is an input layer and one or more hidden layers
connected one after the other. Each neuron receives input from the previous layer neurons or the
input layer. The output of one neuron becomes the input to other neurons in the next layer of the
network, and this process continues until the final layer produces the output of the network. Then,
after passing through one or more hidden layers, this data is transformed into valuable data for the
output layer. Finally, the output layer provides an output in the form of an artificial neural network’s
response to the data that comes in.

Units are linked to one another from one layer to another in the bulk of neural networks. Each of
these links has weights that control how much one unit influences another. The neural network
learns more and more about the data as it moves from one unit to another, ultimately producing an
output from the output layer.

Difference between Machine Learning and Deep Learning :


machine learning and deep learning AI both are subsets of artificial intelligence but there are many
similarities and differences between them.

Machine Learning Deep Learning

Uses artificial neural network architecture to


Apply statistical algorithms to learn the hidden
learn the hidden patterns and relationships in
patterns and relationships in the dataset.
the dataset.

Requires the larger volume of dataset


Can work on the smaller amount of dataset
compared to machine learning

Better for complex task like image processing,


Better for the low-label task.
natural language processing, etc.

Takes less time to train the model. Takes more time to train the model.

A model is created by relevant features which Relevant features are automatically extracted
are manually extracted from images to detect from images. It is an end-to-end learning
an object in the image. process.

More complex, it works like the black box


Less complex and easy to interpret the result.
interpretations of the result are not easy.

It can work on the CPU or requires less


It requires a high-performance computer
computing power as compared to deep
with GPU.
learning.

Types of neural networks

Deep Learning models are able to automatically learn features from the data, which makes them
well-suited for tasks such as image recognition, speech recognition, and natural language processing.
The most widely used architectures in deep learning are feedforward neural networks, convolutional
neural networks (CNNs), and recurrent neural networks (RNNs).

1. Feedforward neural networks (FNNs) are the simplest type of ANN, with a linear flow of
information through the network. FNNs have been widely used for tasks such as image
classification, speech recognition, and natural language processing.

2. Convolutional Neural Networks (CNNs) are specifically for image and video recognition tasks.
CNNs are able to automatically learn features from the images, which makes them well-
suited for tasks such as image classification, object detection, and image segmentation.
3. Recurrent Neural Networks (RNNs) are a type of neural network that is able to process
sequential data, such as time series and natural language. RNNs are able to maintain an
internal state that captures information about the previous inputs, which makes them well-
suited for tasks such as speech recognition, natural language processing, and language
translation.

Deep Learning Applications:

The main applications of deep learning AI can be divided into computer vision, natural language
processing (NLP), and reinforcement learning.

1. Computer vision

The first Deep Learning applications is Computer vision. In computer vision, Deep learning AI models
can enable machines to identify and understand visual data. Some of the main applications of deep
learning in computer vision include:

 Object detection and recognition: Deep learning model can be used to identify and locate
objects within images and videos, making it possible for machines to perform tasks such as
self-driving cars, surveillance, and robotics.

 Image classification: Deep learning models can be used to classify images into categories
such as animals, plants, and buildings. This is used in applications such as medical imaging,
quality control, and image retrieval.

 Image segmentation: Deep learning models can be used for image segmentation into
different regions, making it possible to identify specific features within images.

2. Natural language processing (NLP):

In Deep learning applications, second application is NLP. NLP, the Deep learning model can enable
machines to understand and generate human language. Some of the main applications of deep
learning in NLP include:

 Automatic Text Generation – Deep learning model can learn the corpus of text and new text
like summaries, essays can be automatically generated using these trained models.

 Language translation: Deep learning models can translate text from one language to
another, making it possible to communicate with people from different linguistic
backgrounds.

 Sentiment analysis: Deep learning models can analyze the sentiment of a piece of text,
making it possible to determine whether the text is positive, negative, or neutral. This is used
in applications such as customer service, social media monitoring, and political analysis.

 Speech recognition: Deep learning models can recognize and transcribe spoken words,
making it possible to perform tasks such as speech-to-text conversion, voice search, and
voice-controlled devices.

3. Reinforcement learning:

In reinforcement learning, deep learning works as training agents to take action in an environment to
maximize a reward. Some of the main applications of deep learning in reinforcement learning
include:
 Game playing: Deep reinforcement learning models have been able to beat human experts
at games such as Go, Chess, and Atari.

 Robotics: Deep reinforcement learning models can be used to train robots to perform
complex tasks such as grasping objects, navigation, and manipulation.

 Control systems: Deep reinforcement learning models can be used to control complex
systems such as power grids, traffic management, and supply chain optimization.

Challenges in Deep Learning

Deep learning has made significant advancements in various fields, but there are still some
challenges that need to be addressed. Here are some of the main challenges in deep learning:

1. Data availability: It requires large amounts of data to learn from. For using deep learning it’s
a big concern to gather as much data for training.

2. Computational Resources: For training the deep learning model, it is computationally


expensive because it requires specialized hardware like GPUs and TPUs.

3. Time-consuming: While working on sequential data depending on the computational


resource it can take very large even in days or months.

4. Interpretability: Deep learning models are complex, it works like a black box. it is very
difficult to interpret the result.

5. Overfitting: when the model is trained again and again, it becomes too specialized for the
training data, leading to overfitting and poor performance on new data.

Advantages of Deep Learning:

1. High accuracy: Deep Learning algorithms can achieve state-of-the-art performance in various
tasks, such as image recognition and natural language processing.

2. Automated feature engineering: Deep Learning algorithms can automatically discover and
learn relevant features from data without the need for manual feature engineering.

3. Scalability: Deep Learning models can scale to handle large and complex datasets, and can
learn from massive amounts of data.

4. Flexibility: Deep Learning models can be applied to a wide range of tasks and can handle
various types of data, such as images, text, and speech.

5. Continual improvement: Deep Learning models can continually improve their performance
as more data becomes available.

Disadvantages of Deep Learning:

1. High computational requirements: Deep Learning AI models require large amounts of data
and computational resources to train and optimize.

2. Requires large amounts of labeled data: Deep Learning models often require a large amount
of labeled data for training, which can be expensive and time- consuming to acquire.

3. Interpretability: Deep Learning models can be challenging to interpret, making it difficult to


understand how they make decisions.
Overfitting: Deep Learning models can sometimes overfit to the training data, resulting in
poor performance on new and unseen data.

4. Black-box nature: Deep Learning models are often treated as black boxes, making it difficult
to understand how they work and how they arrived at their predictions.

Conclusion

In conclusion, the field of Deep Learning represents a transformative leap in artificial intelligence. By
mimicking the human brain’s neural networks, Deep Learning AI algorithms have revolutionized
industries ranging from healthcare to finance, from autonomous vehicles to natural language
processing. As we continue to push the boundaries of computational power and dataset sizes, the
potential applications of Deep Learning are limitless. However, challenges such as interpretability and
ethical considerations remain significant. Yet, with ongoing research and innovation, Deep Learning
promises to reshape our future, ushering in a new era where machines can learn, adapt, and solve
complex problems at a scale and speed previously unimaginable.

common architectural principles of deep networks:-


Deep learning architectures are guided by a set of fundamental principles that help design
effective neural networks for a variety of tasks. Here are some of the common architectural
principles of deep networks:

1. Layer Stacking and Depth


 Concept: Deep networks rely on multiple layers to extract hierarchical
representations of data.
o Shallow layers capture low-level features (e.g., edges in images).
o Deeper layers capture higher-level, abstract features.
 Principle: Increasing depth can improve model expressiveness, but excessive depth
can lead to issues like vanishing gradients or overfitting.

2. Nonlinear Transformations
 Purpose: Nonlinear activation functions (e.g., ReLU, Sigmoid, Tanh) enable networks
to model complex, non-linear relationships.
 Principle: Use activation functions that balance computational efficiency and
gradient propagation (ReLU is often preferred).

3. Parameter Sharing
 Concept: Techniques like convolution in Convolutional Neural Networks (CNNs) share
parameters across spatial dimensions.
 Principle: Parameter sharing reduces the number of learnable parameters and helps
encode inductive biases like translational invariance.

4. Regularization
 Purpose: Prevent overfitting and improve generalization.
o Techniques include L1/L2 regularization, dropout, and batch normalization.
 Principle: Incorporate regularization to balance training accuracy with test
performance.

5. Optimization Efficiency
 Concept: Deep networks rely on optimization algorithms (e.g., SGD, Adam) to update
weights.
 Principle: Use gradient-based optimization methods with adaptive learning rates for
efficient convergence.

6. Modularity
 Design Philosophy: Architectures are built as modular blocks (e.g., residual blocks in
ResNets, transformers in NLP).
 Principle: Modularity aids in scalability, debugging, and transferability.

7. Skip Connections
 Purpose: Mitigate vanishing gradient problems and improve gradient flow by
allowing direct information paths.
o E.g., Residual Networks (ResNets) use skip connections.
 Principle: Use skip or residual connections in very deep networks.

8. Attention Mechanisms
 Use: Assign different importance to different parts of the input.
o E.g., Self-attention in Transformers.
 Principle: Incorporate attention for tasks requiring a focus on specific data regions
(e.g., NLP, vision).
9. Scalability
 Design: Architectures should scale well with increasing data, layers, and hardware
(e.g., efficient models like MobileNets for low-power devices).
 Principle: Adapt architectures for both large-scale and resource-constrained
environments.

10. Data Augmentation


 Importance: Enhance training by artificially increasing the dataset size and diversity.
 Principle: Use augmentation techniques like cropping, flipping, and rotation to
improve generalization.

11. Early Stopping


 Purpose: Stop training when validation performance stops improving to prevent
overfitting.
 Principle: Monitor validation loss to decide when to halt training.

12. Ensemble Learning


 Concept: Combine predictions from multiple models to enhance robustness and
accuracy.
 Principle: Use ensemble methods like bagging or averaging when feasible.

parameters and layers in deep learning:-


In deep learning, **parameters** and **layers** are crucial components that define the
architecture and functionality of neural networks. Here's an overview:

### 1. **Parameters in Deep Learning**:


Parameters are the learnable variables of the model, typically consisting of **weights** and
**biases**. These are adjusted during training to minimize the loss function and improve
the model’s performance.

- **Weights**: These are the coefficients that represent the strength of the connections
between neurons in different layers. They are multiplied by input data to propagate
information forward through the network.
- **Biases**: Bias terms are added to the weighted sum of inputs to adjust the output of the
neuron independently of the inputs. They help the network learn patterns that don’t pass
through the origin (zero point).

**Key Properties**:
- Parameters are optimized using algorithms like **gradient descent** during training.
- The number of parameters depends on the architecture of the network and the number of
neurons in each layer.

### 2. **Layers in Deep Learning**:


Layers are the building blocks of neural networks, and each layer performs specific
transformations on the input data.

#### **Types of Layers**:


1. **Input Layer**: The first layer that receives the input data. It doesn't perform any
computation; it simply passes the input to the next layer.

2. **Hidden Layers**: These layers perform the bulk of the computation and learning in the
network. They consist of multiple neurons that apply an activation function to the weighted
inputs.
- **Fully Connected (Dense) Layer**: Each neuron is connected to all neurons in the
previous and next layers.
- **Convolutional Layer (Conv Layer)**: Used in Convolutional Neural Networks (CNNs), it
applies filters to the input data to capture spatial hierarchies in images.
- **Recurrent Layer (RNN Layer)**: Used in Recurrent Neural Networks (RNNs) for
sequential data, where the output of one step is fed into the next.
- **Pooling Layer**: Often used in CNNs, it reduces the dimensionality of the input by
downsampling (e.g., max pooling, average pooling).

3. **Output Layer**: The final layer that produces the output of the network. In
classification tasks, the output layer typically uses a **softmax** or **sigmoid** activation
function to produce probabilities.

**Activation Functions**:
- **ReLU (Rectified Linear Unit)**: Popular due to its simplicity and effectiveness, it
introduces non-linearity by outputting the input if positive and zero otherwise.
- **Sigmoid**: Used for binary classification, it squashes values between 0 and 1.
- **Tanh**: Similar to sigmoid but squashes values between -1 and 1.

### 3. **How Parameters and Layers Interact**:


- Each connection between neurons in different layers has a weight parameter.
- The product of input and weights (plus biases) is passed through the activation function.
- During training, the network adjusts the weights and biases by minimizing the error
between predicted and actual outputs using backpropagation.

### Example Architecture:


In a simple feedforward neural network (e.g., 3-layer network):
- **Input Layer**: 3 neurons (for 3 input features).
- **Hidden Layer**: 4 neurons with ReLU activation.
- **Output Layer**: 1 neuron with sigmoid activation for binary classification.

The total number of parameters is computed based on the number of connections and
biases between layers.

Let me know if you'd like more details about a specific type of network or how to calculate
parameters for a given architecture!

Activation functions in Neural Networks :-


It is recommended to understand Neural Networks before reading this article.

In the process of building a neural network, one of the choices you get to make is what Activation
Function to use in the hidden layer as well as at the output layer of the network. This article
discusses Activation functions in Neural Networks.

What is an Activation Function?

An activation function in the context of neural networks is a mathematical function applied to the
output of a neuron. The purpose of an activation function is to introduce non-linearity into the
model, allowing the network to learn and represent complex patterns in the data. Without non-
linearity, a neural network would essentially behave like a linear regression model, regardless of the
number of layers it has.
The activation function decides whether a neuron should be activated or not by calculating the
weighted sum and further adding bias to it. The purpose of the activation function is to introduce
non-linearity into the output of a neuron.

Explanation: We know, the neural network has neurons that work in correspondence with weight,
bias, and their respective activation function. In a neural network, we would update the weights and
biases of the neurons on the basis of the error at the output. This process is known as back-
propagation. Activation functions make the back-propagation possible since the gradients are
supplied along with the error to update the weights and biases.

Elements of a Neural Network

Input Layer: This layer accepts input features. It provides information from the outside world to the
network, no computation is performed at this layer, nodes here just pass on the
information(features) to the hidden layer.

Hidden Layer: Nodes of this layer are not exposed to the outer world, they are part of the
abstraction provided by any neural network. The hidden layer performs all sorts of computation on
the features entered through the input layer and transfers the result to the output layer.

Output Layer: This layer bring up the information learned by the network to the outer world.

Why do we need Non-linear activation function?

A neural network without an activation function is essentially just a linear regression model. The
activation function does the non-linear transformation to the input making it capable to learn and
perform more complex tasks.

Mathematical proof

Suppose we have a Neural net like this :-


Elements of the diagram are as follows:

Hidden layer i.e. layer 1:

z(1) = W(1)X + b(1) a(1)

Here,

 z(1) is the vectorized output of layer 1

 W(1) be the vectorized weights assigned to neurons of hidden layer i.e. w1, w2, w3 and w4

 X be the vectorized input features i.e. i1 and i2

 b is the vectorized bias assigned to neurons in hidden layer i.e. b1 and b2

 a(1) is the vectorized form of any linear function.

(Note: We are not considering activation function here)

Layer 2 i.e. output layer :-

Note : Input for layer 2 is output from layer 1

z(2) = W(2)a(1) + b(2)

a(2) = z(2)

Calculation at Output layer

z(2) = (W(2) * [W(1)X + b(1)]) + b(2)

z(2) = [W(2) * W(1)] * X + [W(2)*b(1) + b(2)]

Let,

[W(2) * W(1)] = W

[W(2)*b(1) + b(2)] = b

Final output : z(2) = W*X + b

which is again a linear function

This observation results again in a linear function even after applying a hidden layer, hence we can
conclude that, doesn’t matter how many hidden layer we attach in neural net, all layers will behave
same way because the composition of two linear function is a linear function itself. Neuron can not
learn with just a linear function attached to it. A non-linear activation function will let it learn as per
the difference w.r.t error. Hence we need an activation function.

Variants of Activation Function

Linear Function

 Equation : Linear function has the equation similar to as of a straight line i.e. y = x

 No matter how many layers we have, if all are linear in nature, the final activation function of
last layer is nothing but just a linear function of the input of first layer.
 Range : -inf to +inf

 Uses : Linear activation function is used at just one place i.e. output layer.

 Issues : If we will differentiate linear function to bring non-linearity, result will no more
depend on input “x” and function will become constant, it won’t introduce any ground-
breaking behavior to our algorithm.

For example : Calculation of price of a house is a regression problem. House price may have any
big/small value, so we can apply linear activation at output layer. Even in this case neural net must
have any non-linear function at hidden layers.

Sigmoid Function

 It is a function which is plotted as ‘S’ shaped graph.

 Equation : A = 1/(1 + e-x)

 Nature : Non-linear. Notice that X values lies between -2 to 2, Y values are very steep. This
means, small changes in x would also bring about large changes in the value of Y.

 Value Range : 0 to 1

 Uses : Usually used in output layer of a binary classification, where result is either 0 or 1, as
value for sigmoid function lies between 0 and 1 only so, result can be predicted easily to
be 1 if value is greater than 0.5 and 0 otherwise.
Tanh Function

 The activation that works almost always better than sigmoid function is Tanh function also
known as Tangent Hyperbolic function. It’s actually mathematically shifted version of the
sigmoid function. Both are similar and can be derived from each other.

 Equation :-
f(x) = tanh(x) = 2/(1 + e-2x) – 1
OR
tanh(x) = 2 * sigmoid(2x) – 1

 Value Range :- -1 to +1

 Nature :- non-linear
 Uses :- Usually used in hidden layers of a neural network as it’s values lies between -1 to
1 hence the mean for the hidden layer comes out be 0 or very close to it, hence helps
in centering the data by bringing mean close to 0. This makes learning for the next layer
much easier.

RELU Function

 It Stands for Rectified linear unit. It is the most widely used activation function. Chiefly
implemented in hidden layers of Neural network.

 Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.

 Value Range :- [0, inf)

 Nature :- non-linear, which means we can easily backpropagate the errors and have multiple
layers of neurons being activated by the ReLU function.

 Uses :- ReLu is less computationally expensive than tanh and sigmoid because it involves
simpler mathematical operations. At a time only a few neurons are activated making the
network sparse making it efficient and easy for computation.

In simple words, RELU learns much faster than sigmoid and Tanh function.

Softmax Function
The softmax function is also a type of sigmoid function but is handy when we are trying to handle
multi- class classification problems.

 Nature :- non-linear

 Uses :- Usually used when trying to handle multiple classes. the softmax function was
commonly found in the output layer of image classification problems.The softmax function
would squeeze the outputs for each class between 0 and 1 and would also divide by the sum
of the outputs.

 Output:- The softmax function is ideally used in the output layer of the classifier where we
are actually trying to attain the probabilities to define the class of each input.

 The basic rule of thumb is if you really don’t know what activation function to use, then
simply use RELU as it is a general activation function in hidden layers and is used in most
cases these days.

 If your output is for binary classification then, sigmoid function is very natural choice for
output layer.

 If your output is for multi-class classification then, Softmax is very useful to predict the
probabilities of each classes.

Why Do We Need Loss Functions in Deep Learning?


There are two possible mathematical operations happening inside a neural network:

 Forward propagation

 Backpropagation with gradient descent

While forward propagation refers to the computational process of predicting an output for a given
input vector x, backpropagation and gradient descent describe the process of improving the weights
and biases of the network in order to make better predictions. Let’s look at this in practice.

For a given input vector x the neural network predicts an output, which is generally called a
prediction vector y.

Feedforward
neural network.

The equations describing the mathematics happening during the prediction vector’s computation
looks like this:
Forward propagation.

We must compute a dot-product between the input vector x and the weight matrix W1 that connects
the first layer with the second. After that, we apply a non-linear activation function to the result of
the dot-product.

What Are Loss Functions?

 A loss function measures how good a neural network model is in performing a certain task,
which in most cases is regression or classification.

 We must minimize the value of the loss function during the backpropagation step in order to
make the neural network better.

 We only use the cross-entropy loss function in classification tasks when we want the neural
network to predict probabilities.

 For regression tasks, when we want the network to predict continuous numbers, we must
use the mean squared error loss function.

 We use mean absolute percentage error loss function during demand forecasting to keep an
eye on the performance of the network during training time.

The prediction vector can represent a number of things depending on the task we want the network
to do. For regression tasks, which are basically predictions of continuous variables (e.g. stock price,
expected demand for products, etc.), the output vector y contains continuous numbers.

Regardless of the task, we somehow have to measure how close our predictions are to the ground
truth label.

On the other hand, for classification tasks, such as customer segmentation or image classification,
the output vector y represents probability scores between 0.0 and 1.0.

The value we want the neural network to predict is called a ground truth label, which is usually
represented as y_hat. A predicted value y closer to the label suggests a better performance of the
neural network.

Regardless of the task, we somehow have to measure how close our predictions are to the ground
truth label.

This is where the concept of a loss function comes into play.


Mathematically, we can measure the difference (or error) between the prediction vector y and the
label y_hat by defining a loss function whose value depends on this difference.

An example of a general loss function is the quadratic loss:

Since the prediction vector y(θ) is a function of the neural network’s weights (which we abbreviate to
θ), the loss is also a function of the weights.

Since the loss depends on weights, we must find a certain set of weights for which the value of the
loss function is as small as possible. We achieve this mathematically through a method
called gradient descent.

The value of this loss function depends on the difference between the label y_hat and y. A higher
difference means a higher loss value while (you guessed it) a smaller difference means a smaller loss
value. Minimizing the loss function directly leads to more accurate predictions of the neural network
as the difference between the prediction and the label decreases.

The neural network’s only objective is to minimize the loss function.

In fact, the neural network’s only objective is to minimize the loss function. This is because
minimizing the loss function automatically causes the neural network model to make better
predictions regardless of the exact characteristics of the task at hand.

A neural network solves tasks without being explicitly programmed with a task-specific rule. This is
possible because the goal of minimizing the loss function is universal and doesn’t depend on the task
or circumstances.

3 Key Types of Loss Functions in Neural Networks

That said, you still have to select the right loss function for the task at hand. Luckily there are only
three loss functions you need to know to solve almost any problem.

3 Key Loss Functions

1. Mean Squared Error Loss Function(regression)

2. Cross-Entropy Loss Function(classification)

3. Mean Absolute Percentage Error

1. Mean Squared Error Loss Function

Mean squared error (MSE) loss function is the sum of squared differences between the entries in the
prediction vector y and the ground truth vector y_hat.
MSE loss function

You divide the sum of squared differences by N, which corresponds to the length of the vectors. If
the output y of your neural network is a vector with multiple entries then N is the number of the
vector entries with y_i being one particular entry in the output vector.

The mean squared error loss function is the perfect loss function if you're dealing with a regression
problem. That is, if you want your neural network to predict a continuous scalar value.

An example of a regression problem would be predictions of . . .

 the number of products needed in a supply chain.

 future real estate prices under certain market conditions.

 a stock value.

Here is a code snippet where I've calculated MSE loss in Python:

import numpy as np

# The prediction vector of the neural network

y_pred=[0.6, 1.29, 1.99, 2.69, 3.4]

# The ground truth label

y_hat= [1, 1, 2, 2, 4]

# Mean squared error

MSE = np.sum(np.square(np.subtract(y_hat, y_pred)))/len(y_hat)

print(MSE) # The result is 0.21606

Learn More From Our ExpertsWhat Is Linear Regression?

2. Cross-Entropy Loss Function

Regression is only one of two areas where feedforward networks enjoy great popularity. The other
area is classification.
In classification tasks, we deal with predictions of probabilities, which means the output of a neural
network must be in a range between zero and one. A loss function that can measure the error
between a predicted probability and the label which represents the actual class is called the cross-
entropy loss function.

One important thing we need to discuss before continuing with the cross-entropy is what exactly the
ground truth vector looks like in the case of a classification problem.

One-hot-encoded vector (left) and prediction


vector (right).

The label vector y_hat is one hot encoded which means the values in this vector can only take
discrete values of either zero or one. The entries in this vector represent different classes. The values
of these entries are zero, except for a single entry which is one. This entry tells us the class into which
we want to classify the input feature vector x.

The prediction y, however, can take continuous values between zero and one.

Given the prediction vector y and the ground truth vector y_hat you can compute the cross-entropy
loss between those two vectors as follows:

Cross-entropy loss function

First, we need to sum up the products between the entries of the label vector y_hat and the
logarithms of the entries of the predictions vector y. Then we must negate the sum to get a positive
value of the loss function.
One interesting thing to consider is the plot of the cross-entropy loss function. In the following graph,
you can see the value of the loss function (y-axis) vs. the predicted probability y_i. Here y_i takes
values between zero and one.

Cross-entropy function depending on prediction value.

We can see clearly that the cross-entropy loss function grows exponentially for lower values of the
predicted probability y_i. For y_i=0 the function becomes infinite, while for y_i=1 the neural network
makes an accurate probability prediction and the loss value goes to zero.

Here’s another code snippet in Python where I’ve calculated the cross-entropy loss function:

import numpy as np

# The probabilities predicted by the neural network

y_pred = [0.1, 0.3, 0.4, 0.2]

# one-hot-encoded ground truth label

y_hat =[0, 1, 0, 0]
cross_entropy = - np.sum(np.log(y_pred)*y_hat)

print(cross_entropy) # The Result is 1.20

Read More From Our Machine Learning ExpertsMachine Learning for Beginners

3. Mean Absolute Percentage Error

Finally, we come to the Mean Absolute Percentage Error (MAPE) loss function. This loss function
doesn’t get much attention in deep learning. For the most part, we use it to measure the
performance of a neural network during demand forecasting tasks.

First thing first: what is demand forecasting?

Demand forecasting is the area of predictive analytics dedicated to predicting the expected demand
for a good or service in the near future. For example:

 In retail, we can use demand forecasting models to determine the amount of a particular
product that should be available and at what price.

 In industrial manufacturing, we can predict how much of each product should be produced,
the amount of stock that should be available at various points in time, and when
maintenance should be performed.

 In the travel and tourism industry, we can use demand forecasting models to assess optimal
price points for flights and hotels, in light of available capacity, what price should be assigned
(for hotels, flights), which destinations should be spotlighted, or, what types of packages
should be advertised.

Although demand forecasting is also a regression task and the minimization of the MSE loss function
is an adequate training goal, this type of loss function to measure the performance of the model
during training isn’t suitable for demand forecasting.

Why is that?

Well, imagine the MSE loss function gives you a value of 100. Can you tell if this is generally a good
result? No, because it depends on the situation. If the prediction y of the model is 1000 and the
actual ground truth label y_hat is 1010, then the MSE loss of 100 would be in fact a very small error
and the performance of the model would be quite good.

However in the case where the prediction would be five and the label is 15, you would have the
same loss value of 100 but the relative deviation to the ground-truth value would be much higher
than in the previous case.

This example shows the shortcoming of the mean squared error function as the loss function for the
demand forecasting models. For this reason, I strongly recommend using mean absolute percentage
error (MAPE).

The mean absolute percentage error, also known as mean absolute percentage deviation (MAPD)
usually expresses accuracy as a percentage. We define it with the following equation:
In this equation, y_i is the predicted value and y_hat is the label. We divide the difference between
y_i and y_hat by the actual value y_hat again. Finally, multiplying by 100 percent gives us the
percentage error.

Applying this equation to the example above gives you a more meaningful understanding of the
model’s performance. In the first case, the deviation from the ground truth label would be only one
percent, while in the second case the deviation would be 66 percent:

We see that the performance of these two models is very different. Meanwhile, the MSE loss
function would indicate that the performance of both models is the same.

What is Optimizer?

In deep learning, an optimizer is a crucial element that fine-tunes a neural

network’s parameters during training. Its primary role is to minimize the model’s

error or loss function, enhancing performance. Various optimization algorithms,

known as optimizers, employ distinct strategies to converge towards optimal

parameter values for improved predictions efficiently.

What are Optimizers in Deep Learning?

In deep learning, optimizers are crucial as algorithms that dynamically fine-tune a

model’s parameters throughout the training process, aiming to minimize a

predefined loss function. These specialized algorithms facilitate the learning


process of neural networks by iteratively refining the weights and biases based

on the feedback received from the data. Well-known optimizers in deep learning

encompass Stochastic Gradient Descent (SGD), Adam, and RMSprop, each

equipped with distinct update rules, learning rates, and momentum strategies, all

geared towards the overarching goal of discovering and converging upon optimal

model parameters, thereby enhancing overall performance.

Choosing the Right Optimizer

Optimizer algorithms are optimization method that helps improve a deep learning

model’s performance. These optimization algorithms or optimizers widely affect

the accuracy and speed training of the deep learning model. But first of all, the

question arises of what an optimizer is.

While training the deep learning optimizers model, modify each epoch’s weights

and minimize the loss function. An optimizer is a function or an algorithm that

adjusts the attributes of the neural network, such as weights and learning rates.

Thus, it helps in reducing the overall loss and improving accuracy. The problem

of choosing the right weights for the model is a daunting task, as a deep learning

model generally consists of millions of parameters. It raises the need to choose a

suitable optimization algorithm for your application. Hence understanding these

machine learning algorithms is necessary for data scientists before having a

deep dive into the field.

You can use different optimizers in the machine learning model to change your

weights and learning rate. However, choosing the best optimizer depends upon

the application. As a beginner, one evil thought that comes to mind is that we try

all the possibilities and choose the one that shows the best results. This might be
fine initially, but when dealing with hundreds of gigabytes of data, even a single

epoch can take considerable time. So randomly choosing an algorithm is no less

than gambling with your precious time that you will realize sooner or later in your

journey.

This guide will cover various deep-learning optimizers, such as Gradient

Descent, Stochastic Gradient Descent, Stochastic Gradient descent with

momentum, Mini-Batch Gradient Descent, Adagrad, RMSProp, AdaDelta, and

Adam. By the end of the article, you can compare various optimizers and the

procedure they are based upon.

Important Deep Learning Terms

Before proceeding, there are a few terms that you should be familiar with.

 Epoch – The number of times the algorithm runs on the whole training

dataset.

 Sample – A single row of a dataset.

 Batch – It denotes the number of samples to be taken to for updating the

model parameters.

 Learning rate – It is a parameter that provides the model a scale of how

much model weights should be updated.

 Cost Function/Loss Function – A cost function is used to calculate the

cost, which is the difference between the predicted value and the actual

value.
 Weights/ Bias – The learnable parameters in a model that controls the

signal between two neurons.

Now let’s explore each optimizer.

Gradient Descent Deep Learning Optimizer

Gradient Descent can be considered the popular kid among the class of

optimizers in deep learning. This optimization algorithm uses calculus to

consistently modify the values and achieve the local minimum. Before moving

ahead, you might question what a gradient is.

In simple terms, consider you are holding a ball resting at the top of a bowl.

When you lose the ball, it goes along the steepest direction and eventually

settles at the bottom of the bowl. A Gradient provides the ball in the steepest

direction to reach the local minimum which is the bottom of the bowl.

The above equation means how the gradient is calculated. Here alpha is the step

size that represents how far to move against each gradient with each iteration.

Gradient descent works as follows:

1. Initialize Coefficients: Start with initial coefficients.

2. Evaluate Cost: Calculate the cost associated with these coefficients.

3. Search for Lower Cost: Look for a cost value lower than the current one.

4. Update Coefficients: Move towards the lower cost by updating the

coefficients’ values.
5. Repeat Process: Continue this process iteratively.

6. Reach Local Minimum: Stop when a local minimum is reached, where

further cost reduction is not possible.

Gradient descent works best for most purposes. However, it has some

downsides too. It is expensive to calculate the gradients if the size of the data is

huge. Gradient descent works well for convex functions, but it doesn’t know how

far to travel along the gradient for nonconvex functions.

Stochastic Gradient Descent Deep Learning Optimizer

At the end of the previous section, you learned why there might be better options

than using gradient descent on massive data. To tackle the challenges large

datasets pose, we have stochastic gradient descent, a popular approach among

optimizers in deep learning. The term stochastic denotes the element of

randomness upon which the algorithm relies. In stochastic gradient descent,

instead of processing the entire dataset during each iteration, we randomly select

batches of data. This implies that only a few samples from the dataset are

considered at a time, allowing for more efficient and computationally feasible

optimization in deep learning models.

The procedure is first to select the initial parameters w and learning rate n. Then

randomly shuffle the data at each iteration to reach an approximate minimum.

Since we are not using the whole dataset but the batches of it for each iteration,

the path taken by the algorithm is full of noise as compared to the gradient
descent algorithm. Thus, SGD uses a higher number of iterations to reach the

local minima. Due to an increase in the number of iterations, the overall

computation time increases. But even after increasing the number of iterations,

the computation cost is still less than that of the gradient descent optimizer. So

the conclusion is if the data is enormous and computational time is an essential

factor, stochastic gradient descent should be preferred over batch gradient

descent algorithm.

Stochastic Gradient Descent With Momentum Deep


Learning Optimizer

As discussed in the earlier section, you have learned that stochastic gradient

descent takes a much more noisy path than the gradient descent algorithm when

addressing optimizers in deep learning. Due to this, it requires a more significant

number of iterations to reach the optimal minimum, and hence, computation time

is very slow. To overcome the problem, we use stochastic gradient descent with

a momentum algorithm.

What the momentum does is helps in faster convergence of the loss function.

Stochastic gradient descent oscillates between either direction of the gradient

and updates the weights accordingly. However, adding a fraction of the previous

update to the current update will make the process a bit faster. One thing that

should be remembered while using this algorithm is that the learning rate should

be decreased with a high momentum term.

In the above image, the left part shows the convergence graph of the stochastic

gradient descent algorithm. At the same time, the right side shows SGD with
momentum. From the image, you can compare the path chosen by both

algorithms and realize that using momentum helps reach convergence in less

time. You might be thinking of using a large momentum and learning rate to

make the process even faster. But remember that while increasing the

momentum, the possibility of passing the optimal minimum also increases. This

might result in poor accuracy and even more oscillations.

Mini Batch Gradient Descent Deep Learning Optimizer

In this variant of gradient descent, instead of taking all the training data, only a

subset of the dataset is used for calculating the loss function. Since we are using

a batch of data instead of taking the whole dataset, fewer iterations are needed.

That is why the mini-batch gradient descent algorithm is faster than both

stochastic gradient descent and batch gradient descent algorithms. This

algorithm is more efficient and robust than the earlier variants of gradient

descent. As the algorithm uses batching, all the training data need not be loaded

in the memory, thus making the process more efficient to implement. Moreover,

the cost function in mini-batch gradient descent is noisier than the batch gradient

descent algorithm but smoother than that of the stochastic gradient descent

algorithm. Because of this, mini-batch gradient descent is ideal and provides a

good balance between speed and accuracy.

Despite all that, the mini-batch gradient descent algorithm has some downsides

too. It needs a hyperparameter that is “mini-batch-size”, which needs to be tuned

to achieve the required accuracy. Although, the batch size of 32 is considered to

be appropriate for almost every case. Also, in some cases, it results in poor final

accuracy. Due to this, there needs a rise to look for other alternatives too.
Adagrad (Adaptive Gradient Descent) Deep Learning
Optimizer

The adaptive gradient descent algorithm is slightly different from other gradient

descent algorithms. This is because it uses different learning rates for each

iteration. The change in learning rate depends upon the difference in the

parameters during training. The more the parameters get changed, the more

minor the learning rate changes. This modification is highly beneficial because

real-world datasets contain sparse as well as dense features. So it is unfair to

have the same value of learning rate for all the features. The Adagrad algorithm

uses the below formula to update the weights. Here the alpha(t) denotes the

different learning rates at each iteration, n is a constant, and E is a small positive

to avoid division by 0.

The benefit of using Adagrad is that it abolishes the need to modify the learning

rate manually. It is more reliable than gradient descent algorithms and their

variants, and it reaches convergence at a higher speed.

One downside of the AdaGrad optimizer is that it decreases the learning rate

aggressively and monotonically. There might be a point when the learning rate

becomes extremely small. This is because the squared gradients in the

denominator keep accumulating, and thus the denominator part keeps on

increasing. Due to small learning rates, the model eventually becomes unable to

acquire more knowledge, and hence the accuracy of the model is compromised.
RMS Prop (Root Mean Square) Deep Learning
Optimizer

RMS prop is one of the popular optimizers among deep learning enthusiasts.

This is maybe because it hasn’t been published but is still very well-known in the

community. RMS prop is ideally an extension of the work RPPROP. It resolves

the problem of varying gradients. The problem with the gradients is that some of

them were small while others may be huge. So, defining a single learning rate

might not be the best idea. RPPROP uses the gradient sign, adapting the step

size individually for each weight. In this algorithm, the two gradients are first

compared for signs. If they have the same sign, we’re going in the right direction,

increasing the step size by a small fraction. If they have opposite signs, we must

decrease the step size. Then we limit the step size and can now go for the

weight update.

The problem with RPPROP is that it doesn’t work well with large datasets and

when we want to perform mini-batch updates. So, achieving the robustness of

RPPROP and the efficiency of mini-batches simultaneously was the main

motivation behind the rise of RMS prop. RMS prop is an advancement in

AdaGrad optimizer as it reduces the monotonically decreasing learning rate.

RMS Prop Formula

The algorithm mainly focuses on accelerating the optimization process by

decreasing the number of function evaluations to reach the local minimum. The

algorithm keeps the moving average of squared gradients for every weight and

divides the gradient by the square root of the mean square.


where gamma is the forgetting factor. Weights are updated by the below formula

In simpler terms, if there exists a parameter due to which the cost function

oscillates a lot, we want to penalize the update of this parameter. Suppose you

built a model to classify a variety of fishes. The model relies on the factor ‘color’

mainly to differentiate between the fishes. Due to this, it makes a lot of errors.

What RMS Prop does is, penalize the parameter ‘color’ so that it can rely on

other features too. This prevents the algorithm from adapting too quickly to

changes in the parameter ‘color’ compared to other parameters. This algorithm

has several benefits as compared to earlier versions of gradient descent

algorithms. The algorithm converges quickly and requires lesser tuning than

gradient descent algorithms and their variants.

The problem with RMS Prop is that the learning rate has to be defined manually,

and the suggested value doesn’t work for every application.

AdaDelta Deep Learning Optimizer

AdaDelta can be seen as a more robust version of the AdaGrad optimizer. It is

based upon adaptive learning and is designed to deal with significant drawbacks

of AdaGrad and RMS prop optimizer. The main problem with the above two

optimizers is that the initial learning rate must be defined manually. One other

problem is the decaying learning rate which becomes infinitesimally small at

some point. Due to this, a certain number of iterations later, the model can no

longer learn new knowledge.


To deal with these problems, AdaDelta uses two state variables to store the

leaky average of the second moment gradient and a leaky average of the second

moment of change of parameters in the model.

Here St and delta Xt denote the state variables, g’t denotes rescaled gradient,

delta Xt-1 denotes squares rescaled gradients, and epsilon represents a small

positive integer to handle division by 0.

Adam Optimizer in Deep Learning

Adam optimizer, short for Adaptive Moment Estimation optimizer, is an

optimization algorithm commonly used in deep learning. It is an extension of the

stochastic gradient descent (SGD) algorithm and is designed to update the

weights of a neural network during training.

The name “Adam” is derived from “adaptive moment estimation,” highlighting its

ability to adaptively adjust the learning rate for each network weight individually.

Unlike SGD, which maintains a single learning rate throughout training, Adam

optimizer dynamically computes individual learning rates based on the past

gradients and their second moments.

The creators of Adam optimizer incorporated the beneficial features of other

optimization algorithms such as AdaGrad and RMSProp. Similar to RMSProp,

Adam optimizer considers the second moment of the gradients, but unlike

RMSProp, it calculates the uncentered variance of the gradients (without

subtracting the mean).


By incorporating both the first moment (mean) and second moment (uncentered

variance) of the gradients, Adam optimizer achieves an adaptive learning rate

that can efficiently navigate the optimization landscape during training. This

adaptivity helps in faster convergence and improved performance of the neural

network.

In summary, Adam optimizer is an optimization algorithm that extends SGD by

dynamically adjusting learning rates based on individual weights. It combines the

features of AdaGrad and RMSProp to provide efficient and adaptive updates to

the network weights during deep learning training.

Adam Optimizer Formula

The adam optimizer has several benefits, due to which it is used widely. It is

adapted as a benchmark for deep learning papers and recommended as a

default optimization algorithm. Moreover, the algorithm is straightforward to

implement, has a faster running time, low memory requirements, and requires

less tuning than any other optimization algorithm.

The above formula represents the working of adam optimizer. Here B1 and B2

represent the decay rate of the average of the gradients.

If the adam optimizer uses the good properties of all the algorithms and is the

best available optimizer, then why shouldn’t you use Adam in every application?

And what was the need to learn about other algorithms in depth? This is because

even Adam has some downsides. It tends to focus on faster computation time,
whereas algorithms like stochastic gradient descent focus on data points. That’s

why algorithms like SGD generalize the data in a better manner at the cost of low

computation speed. So, the optimization algorithms can be picked accordingly

depending on the requirements and the type of data.


The above visualizations create a better picture in mind and help in comparing

the results of various optimization algorithms.

Hands-on Optimizers

We have learned enough theory, and now we need to do some practical

analysis. It’s time to try what we have learned and compare the results by

choosing different optimizers on a simple neural network. As we are talking about

keeping things simple, what’s better than the MNIST dataset? We will train a

simple model using some basic layers, keeping the batch size and epochs the

same but with different optimizers. For the sake of fairness, we will use the

default values with each optimizer.


The steps for building the network are given below:

Import Necessary Libraries


import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(x_train.shape, y_train.shape)

Load the Dataset


x_train= x_train.reshape(x_train.shape[0],28,28,1)
x_test= x_test.reshape(x_test.shape[0],28,28,1)
input_shape=(28,28,1)
y_train=keras.utils.to_categorical(y_train)#,num_classes=)
y_test=keras.utils.to_categorical(y_test)#, num_classes)
x_train= x_train.astype('float32')
x_test= x_test.astype('float32')
x_train /= 255
x_test /=255

Build the Model


batch_size=64

num_classes=10

epochs=10

def build_model(optimizer):

model=Sequential()

model.add(Conv2D(32,kernel_size=(3,3),activation='relu',input_shape=input_shape)
)

model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Dropout(0.25))

model.add(Flatten())

model.add(Dense(256, activation='relu'))

model.add(Dropout(0.5))

model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy, optimizer= optimizer,


metrics=['accuracy'])

return model
Train the Model
optimizers = ['Adadelta', 'Adagrad', 'Adam', 'RMSprop', 'SGD']

for i in optimizers:

model = build_model(i)

hist=model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1,


validation_data=(x_test,y_test))

We have run our model with a batch size of 64 for 10 epochs. After trying the

different optimizers, the results we get are pretty interesting. Before analyzing

the results, what do you think will be the best optimizer for this dataset?

Table Analysis

Epoch 1 Epoch 5 Epoch 10


Total
Optimizer Val accuracy | Val Val accuracy | Val Val accuracy | Val
Time
loss loss loss
Adadelta .4612 | 2.2474 .7776 | 1.6943 .8375 | 0.9026 8:02 min
Adagrad .8411 | .7804 .9133 | .3194 .9286 | 0.2519 7:33 min
Adam .9772 | .0701 .9884 | .0344 .9908 | .0297 7:20 min
RMSprop .9783 | .0712 .9846 | .0484 .9857 | .0501 10:01 min
SGD with
.9168 | .2929 .9585 | .1421 .9697 | .1008 7:04 min
momentum
SGD .9124 | .3157 .9569 | 1451 .9693 | .1040 6:42 min

The above table shows the validation accuracy and loss at different epochs. It

also contains the total time that the model took to run on 10 epochs for each

optimizer. From the above table, we can make the following analysis.

 The adam optimizer shows the best accuracy in a satisfactory amount of

time.

 RMSprop shows similar accuracy to that of Adam but with a comparatively

much larger computation time.


 Surprisingly, the SGD algorithm took the least time to train and produced

good results as well. But to reach the accuracy of the Adam optimizer,

SGD will require more iterations, and hence the computation time will

increase.

 SGD with momentum shows similar accuracy to SGD with unexpectedly

larger computation time. This means the value of momentum taken needs

to be optimized.

 Adadelta shows poor results both with accuracy and computation time.

You can analyze the accuracy of each optimizer with each epoch from the below

graph.

We’ve now reached the end of this comprehensive guide. To refresh your

memory, we will go through a summary of every optimization algorithm that we

have covered in this guide. To refresh your memory, we will go through a

summary of every optimization algorithm that we have covered in this guide.

Summary

SGD is a very basic algorithm and is hardly used in applications now due to its

slow computation speed. One more problem with that algorithm is the constant
learning rate for every epoch. Moreover, it is not able to handle saddle points

very well. Adagrad works better than stochastic gradient descent generally due

to frequent updates in the learning rate. It is best when used for dealing with

sparse data. RMSProp shows similar results to that of the gradient descent

algorithm with momentum, it just differs in the way by which the gradients are

calculated.

Lastly comes the Adam optimizer that inherits the good features of RMSProp and

other algorithms. The results of the Adam optimizer are generally better than

every other optimization algorithm, have faster computation time, and require

fewer parameters for tuning. Because of all that, Adam is recommended as the

default optimizer for most of the applications. Choosing the Adam optimizer for

your application might give you the best probability of getting the best results.

But by the end, we learned that even Adam optimizer has some downsides. Also,

there are cases when algorithms like SGD might be beneficial and perform better

than Adam optimizer. So, it is of utmost importance to know your requirements

and the type of data you are dealing with to choose the best optimization

algorithm and achieve outstanding results.

Conclusion

This article enlightened us on how optimization algorithms play a crucial role in

shaping the performance of a deep learning model, influencing factors such as

accuracy, speed, and efficiency. Throughout the discussion, we delved into

various optimizers in deep learning, allowing for a comprehensive comparison

among them. By understanding the strengths and weaknesses of each algorithm,


readers gained valuable insights into the optimal scenarios for deploying specific

optimizers and the potential drawbacks associated with their use.

Key Takeaways

 Gradient Descent, Stochastic Gradient Descent, Mini-batch Gradient

Descent, Adagrad, RMS Prop, AdaDelta, and Adam are all popular deep-

learning optimizers.

 Each optimizer has its own strengths and weaknesses, and the choice of

optimizer will depend on the specific deep-learning task and the

characteristics of the data being used.

 The choice of optimizer can significantly impact the speed and quality of

convergence during training, as well as the final performance of the deep

learning model.

Hyperparameters in Machine Learning:-

Hyperparameters in Machine learning are those parameters that are

explicitly defined by the user to control the learning process. These

hyperparameters are used to improve the learning of the model, and their values

are set before starting the learning process of the model.

In this topic, we are going to discuss one of the most important concepts of

machine learning, i.e., Hyperparameters, their examples, hyperparameter tuning,

categories of hyperparameters, how hyperparameter is different from parameter


in Machine Learning? But before starting, let's first understand the

Hyperparameter.

What are hyperparameters?

In Machine Learning/Deep Learning, a model is represented by its parameters. In

contrast, a training process involves selecting the best/optimal hyperparameters

that are used by learning algorithms to provide the best result. So, what are

these hyperparameters? The answer is, "Hyperparameters are defined as the

parameters that are explicitly defined by the user to control the learning

process."

Here the prefix "hyper" suggests that the parameters are top-level parameters

that are used in controlling the learning process. The value of the

Hyperparameter is selected and set by the machine learning engineer before the

learning algorithm begins training the model. Hence, these are external to the

model, and their values cannot be changed during the training process.

Some examples of Hyperparameters in Machine Learning

o The k in kNN or K-Nearest Neighbour algorithm

o Learning rate for training a neural network

o Train-test split ratio

o Batch Size

o Number of Epochs

o Branches in Decision Tree


o Number of clusters in Clustering Algorithm

Difference between Parameter and Hyperparameter?

There is always a big confusion between Parameters and hyperparameters or

model hyperparameters. So, in order to clear this confusion, let's understand the

difference between both of them and how they are related to each other.

Model Parameters:

Model parameters are configuration variables that are internal to the model, and

a model learns them on its own. For example, W Weights or Coefficients of

independent variables in the Linear regression model. or Weights or

Coefficients of independent variables in SVM, weight, and biases of a

neural network, cluster centroid in clustering. Some key points for model

parameters are as follows:

o They are used by the model for making predictions.

o They are learned by the model from the data itself

o These are usually not set manually.

o These are the part of the model and key to a machine learning Algorithm.

Model Hyperparameters:

Hyperparameters are those parameters that are explicitly defined by the user to

control the learning process. Some key points for model parameters are as

follows:
o These are usually defined manually by the machine learning engineer.

o One cannot know the exact best value for hyperparameters for the given

problem. The best value can be determined either by the rule of thumb or

by trial and error.

o Some examples of Hyperparameters are the learning rate for training a

neural network, K in the KNN algorithm,

Categories of Hyperparameters

Broadly hyperparameters can be divided into two categories, which are given

below:

1. Hyperparameter for Optimization

2. Hyperparameter for Specific Models

Hyperparameter for Optimization

The process of selecting the best hyperparameters to use is known as

hyperparameter tuning, and the tuning process is also known as hyperparameter

optimization. Optimization parameters are used for optimizing the model.


Some of the popular optimization parameters are given below:

o Learning Rate: The learning rate is the hyperparameter in optimization

algorithms that controls how much the model needs to change in response

to the estimated error for each time when the model's weights are

updated. It is one of the crucial parameters while building a neural

network, and also it determines the frequency of cross-checking with

model parameters. Selecting the optimized learning rate is a challenging

task because if the learning rate is very less, then it may slow down the

training process. On the other hand, if the learning rate is too large, then it

may not optimize the model properly.


Note: Learning rate is a crucial hyperparameter for optimizing the model, so if

there is a requirement of tuning only a single hyperparameter, it is suggested to

tune the learning rate.

o Batch Size: To enhance the speed of the learning process, the training

set is divided into different subsets, which are known as a batch. Number

of Epochs: An epoch can be defined as the complete cycle for training the

machine learning model. Epoch represents an iterative learning process.

The number of epochs varies from model to model, and various models

are created with more than one epoch. To determine the right number of

epochs, a validation error is taken into account. The number of epochs is

increased until there is a reduction in a validation error. If there is no

improvement in reduction error for the consecutive epochs, then it

indicates to stop increasing the number of epochs.

Hyperparameter for Specific Models

Hyperparameters that are involved in the structure of the model are known as

hyperparameters for specific models. These are given below:

o A number of Hidden Units: Hidden units are part of neural networks,

which refer to the components comprising the layers of processors

between input and output units in a neural network.

It is important to specify the number of hidden units hyperparameter for the

neural network. It should be between the size of the input layer and the size of

the output layer. More specifically, the number of hidden units should be 2/3 of

the size of the input layer, plus the size of the output layer.
For complex functions, it is necessary to specify the number of hidden units, but

it should not overfit the model.

o Number of Layers: A neural network is made up of vertically arranged

components, which are called layers. There are mainly input layers,

hidden layers, and output layers. A 3-layered neural network gives a

better performance than a 2-layered network. For a Convolutional Neural

network, a greater number of layers make a better model.

Conclusion

Hyperparameters are the parameters that are explicitly defined to control the

learning process before applying a machine-learning algorithm to a dataset.

These are used to specify the learning capacity and complexity of the model.

Some of the hyperparameters are used for the optimization of the models, such

as Batch size, learning rate, etc., and some are specific to the models, such as

Number of Hidden layers, etc.

frameworks to deploy deep learning networks:-

There are several popular frameworks to deploy deep learning networks, ranging

from end-to-end solutions that allow both model building and deployment to

specialized tools for deployment optimization. Here are the major frameworks:

1. TensorFlow Serving

 Description: TensorFlow Serving is a flexible, high-performance serving

system for machine learning models designed for production

environments.
 Key Features:

o Can serve multiple versions of a model simultaneously.

o Provides out-of-the-box integration with TensorFlow models, though

it also supports other models.

o Optimized for low-latency inference and can handle large-scale

deployments.

o Supports gRPC and REST APIs for serving models.

 Use Case: Web services, real-time prediction systems, scalable

deployment of TensorFlow models.

2. TensorFlow Lite

 Description: TensorFlow Lite is designed for deploying deep learning

models on mobile and embedded devices.

 Key Features:

o Optimizes models for performance and reduced size for use on

mobile phones, IoT devices, and other edge devices.

o Supports both Android and iOS.

o Provides acceleration through hardware like GPUs or dedicated AI

chips.

 Use Case: Mobile apps, edge computing, IoT applications.

3. ONNX (Open Neural Network Exchange)


 Description: ONNX is an open-source format for deep learning models,

allowing models to be used across various frameworks.

 Key Features:

o Supports conversion from many popular frameworks like

TensorFlow, PyTorch, and Keras to ONNX format.

o Once converted to ONNX, models can be deployed using ONNX

Runtime, which is optimized for high-performance inference.

o Cross-framework compatibility allows flexibility in choosing training

and deployment platforms.

 Use Case: Cross-framework deployment, model optimization for inference

speed, platform-agnostic deployment.

4. TorchServe

 Description: TorchServe is a flexible and easy-to-use tool for deploying

PyTorch models in production.

 Key Features:

o Provides APIs for model management (loading, scaling, updating).

o Offers metrics for monitoring models and has native support for

multiple ML/DL model formats.

o Scalable and optimized for batch inference.

 Use Case: Deploying PyTorch models, scaling PyTorch in production

environments.
5. Apache MXNet

 Description: MXNet is a deep learning framework known for its efficiency,

flexibility, and scalability.

 Key Features:

o Supports model deployment across different platforms like cloud,

edge devices, and mobile.

o Allows dynamic network definition and efficient resource usage.

o Offers built-in support for deployment optimization techniques such

as model quantization.

 Use Case: High-performance inference, deployment across a wide variety

of devices, dynamic model updates.

6. NVIDIA TensorRT

 Description: TensorRT is a high-performance deep learning inference

library from NVIDIA.

 Key Features:

o Optimizes deep learning models for NVIDIA GPUs, enabling faster

inference with lower latency.

o Supports model compression, quantization, and precision

calibration (e.g., FP16 and INT8).

o Often used to convert models trained in TensorFlow, PyTorch, or

ONNX to an optimized format for NVIDIA hardware.


 Use Case: Optimized GPU-based deployment for inference in applications

like autonomous driving, robotics, or any real-time systems.

7. AWS SageMaker

 Description: Amazon SageMaker is a fully managed service that allows

you to build, train, and deploy machine learning models.

 Key Features:

o Provides tools for deploying models in production with auto-scaling,

monitoring, and maintenance.

o Supports most popular frameworks like TensorFlow, PyTorch, and

MXNet, as well as custom containers.

o Includes built-in algorithms for optimization and handles deployment

across distributed cloud environments.

 Use Case: Cloud-based deployment with seamless integration into AWS

infrastructure, handling large-scale production ML models.

8. Microsoft Azure Machine Learning

 Description: A cloud-based machine learning platform that provides tools

for developing, training, and deploying deep learning models.

 Key Features:

o Supports multiple frameworks, including TensorFlow, PyTorch, and

ONNX.
o Provides tools for automating model management, version control,

and deployment.

o Offers autoscaling and accelerated inference for production

deployment.

 Use Case: Large-scale cloud-based machine learning solutions in

Microsoft Azure.

9. Google AI Platform (Vertex AI)

 Description: Google’s Vertex AI is a unified platform that brings together

everything needed to build, deploy, and scale machine learning models.

 Key Features:

o Provides seamless integration with TensorFlow, Keras, Scikit-learn,

and XGBoost models.

o Handles training, tuning, and deployment with automated workflows

and monitoring.

o Supports distributed model deployment in cloud environments and

offers managed services like auto-scaling and model monitoring.

 Use Case: Full-stack machine learning lifecycle management, cloud-

based model deployment.

Building Blocks
Building Blocks of Deep Neural Networks
Deep neural networks (DNNs) are composed of several key components, each
serving a specific function in learning, processing, and producing output. Here’s a
breakdown of the essential building blocks:
1. Neurons (Artificial Neurons)
 Purpose: The basic units of a neural network that take input, process it, and
pass an output.
 Components:
o Inputs: Features or data points.
o Weights (www): Coefficients that adjust the importance of each input.
o Bias (bbb): A trainable parameter that helps the model fit data better
by shifting the output.
o Activation Function (f(x)f(x)f(x)): Determines the output of a neuron
(e.g., ReLU, Sigmoid).

2. Layers
 Input Layer:
o Purpose: Receives the input data.
o The number of neurons depends on the dimensionality of the input
data (e.g., pixels in an image).
 Hidden Layers:
o Purpose: Perform computations and learn complex patterns in the
data.
o Consists of multiple neurons that transform inputs using weights,
biases, and activation functions.
o The depth of the network refers to the number of hidden layers.
 Output Layer:
o Purpose: Produces the final output of the network (e.g., class
probabilities in classification).
o The number of neurons depends on the output format (e.g., binary
classification has one neuron with a Sigmoid function).

3. Weights and Biases


 Weights (www): Determine the strength of the connection between input and
output neurons.
 Bias (bbb): Allows the network to adjust outputs independently of the input
data.
4. Activation Functions
 Transform the input signal and add non-linearity to the network.
Common Activation Functions:
 ReLU (Rectified Linear Unit): f(x)=max⁡(0,x)f(x) = \max(0, x)f(x)=max(0,x)
 Sigmoid: f(x)=11+e−xf(x) = \frac{1}{1 + e^{-x}}f(x)=1+e−x1 (used in binary
classification).
 Tanh: f(x)=ex−e−xex+e−xf(x) = \frac{e^x - e^{-x}}{e^x + e^{-
x}}f(x)=ex+e−xex−e−x.
 Softmax: Often used in the output layer for multiclass classification.

5. Loss Function
 Purpose: Measures the difference between the predicted output and the
actual target.
 Common Loss Functions:
o Mean Squared Error (MSE) for regression problems.
o Cross-Entropy Loss for classification problems.

6. Optimization Algorithms
 Purpose: Adjust the weights and biases to minimize the loss function.
Common Optimizers:
 Stochastic Gradient Descent (SGD)
 Adam (Adaptive Moment Estimation)
 RMSprop

7. Forward Propagation
 The process by which input data passes through the network, layer by layer,
to produce an output prediction.
 Involves multiplying inputs by weights, adding bias, and applying the
activation function.

8. Backward Propagation (Backpropagation)


 Purpose: Updates the network's weights and biases to minimize the loss.
 Uses the chain rule of calculus to calculate the gradient of the loss with
respect to each parameter (weights and biases).
 Gradients are then adjusted according to an optimizer (e.g., using stochastic
gradient descent).

9. Regularization Techniques
 Prevent overfitting by reducing model complexity.
Common Regularization Methods:
 L2 Regularization (Ridge Regression): Adds the squared magnitude of the
weights to the loss function.
 L1 Regularization (Lasso Regression): Adds the absolute magnitude of the
weights to the loss function.
 Dropout: Randomly disables neurons during training to improve
generalization.
 Batch Normalization: Normalizes input across a batch to stabilize learning
and speed up convergence.

10. Training and Evaluation


 Training: The process where the network learns by adjusting weights and
biases based on input data and the loss.
 Validation: Used to fine-tune hyperparameters and prevent overfitting.
 Testing: Ensures the model's ability to generalize to unseen data.

Summary
 Neurons: Basic computational units.
 Layers (Input, Hidden, Output): Organize the data flow through the network.
 Activation Functions: Introduce non-linearity.
 Loss Functions: Measure prediction accuracy.
 Optimization Algorithms: Minimize the loss function.
 Regularization Techniques: Prevent overfitting.
 Forward and Backward Propagation: Train and refine the network.
Restricted Boltzmann Machine:-
Introduction :
Restricted Boltzmann Machine (RBM) is a type of artificial neural network
that is used for unsupervised learning. It is a type of generative model
that is capable of learning a probability distribution over a set of input
data.
RBM was introduced in the mid-2000s by Hinton and Salakhutdinov as a
way to address the problem of unsupervised learning. It is a type of neural
network that consists of two layers of neurons – a visible layer and a
hidden layer. The visible layer represents the input data, while the hidden
layer represents a set of features that are learned by the network.
The RBM is called “restricted” because the connections between the
neurons in the same layer are not allowed. In other words, each neuron in
the visible layer is only connected to neurons in the hidden layer, and vice
versa. This allows the RBM to learn a compressed representation of the
input data by reducing the dimensionality of the input.
The RBM is trained using a process called contrastive divergence, which is
a variant of the stochastic gradient descent algorithm. During training, the
network adjusts the weights of the connections between the neurons in
order to maximize the likelihood of the training data. Once the RBM is
trained, it can be used to generate new samples from the learned
probability distribution.
RBM has found applications in a wide range of fields, including computer
vision, natural language processing, and speech recognition. It has also
been used in combination with other neural network architectures, such
as deep belief networks and deep neural networks, to improve their
performance.
What are Boltzmann Machines?
It is a network of neurons in which all the neurons are connected to each
other. In this machine, there are two layers named visible layer or input
layer and hidden layer. The visible layer is denoted as v and the hidden
layer is denoted as the h. In Boltzmann machine, there is no output layer.
Boltzmann machines are random and generative neural networks capable
of learning internal representations and are able to represent and (given
enough time) solve tough combinatoric problems.
The Boltzmann distribution (also known as Gibbs Distribution) which is
an integral part of Statistical Mechanics and also explain the impact of
parameters like Entropy and Temperature on the Quantum States in
Thermodynamics. Due to this, it is also known as Energy-Based Models
(EBM). It was invented in 1985 by Geoffrey Hinton, then a Professor at
Carnegie Mellon University, and Terry Sejnowski, then a Professor at Johns
Hopkins University

What are Restricted Boltzmann Machines (RBM)?


A restricted term refers to that we are not allowed to connect the same
type layer to each other. In other words, the two neurons of the input
layer or hidden layer can’t connect to each other. Although the hidden
layer and visible layer can be connected to each other.
As in this machine, there is no output layer so the question arises how we
are going to identify, adjust the weights and how to measure the that our
prediction is accurate or not. All the questions have one answer, that is
Restricted Boltzmann Machine.
The RBM algorithm was proposed by Geoffrey Hinton (2007), which learns
probability distribution over its sample training data inputs. It has seen
wide applications in different areas of supervised/unsupervised machine
learning such as feature learning, dimensionality reduction, classification,
collaborative filtering, and topic modeling.
Consider the example movie rating discussed in the recommender system
section.
Movies like Avengers, Avatar, and Interstellar have strong associations
with the latest fantasy and science fiction factor. Based on the user rating
RBM will discover latent factors that can explain the activation of movie
choices. In short, RBM describes variability among correlated variables of
input dataset in terms of a potentially lower number of unobserved
variables.
The energy function is given by
Applications of Restricted Boltzmann Machine
Restricted Boltzmann Machines (RBMs) have found numerous applications
in various fields, some of which are:
 Collaborative filtering: RBMs are widely used in collaborative
filtering for recommender systems. They learn to predict user
preferences based on their past behavior and recommend items
that are likely to be of interest to the user.
 Image and video processing: RBMs can be used for image and
video processing tasks such as object recognition, image denoising,
and image reconstruction. They can also be used for tasks such as
video segmentation and tracking.
 Natural language processing: RBMs can be used for natural
language processing tasks such as language modeling, text
classification, and sentiment analysis. They can also be used for
tasks such as speech recognition and speech synthesis.
 Bioinformatics: RBMs have found applications in bioinformatics for
tasks such as protein structure prediction, gene expression analysis,
and drug discovery.
 Financial modeling: RBMs can be used for financial modeling
tasks such as predicting stock prices, risk analysis, and portfolio
optimization.
 Anomaly detection: RBMs can be used for anomaly detection
tasks such as fraud detection in financial transactions, network
intrusion detection, and medical diagnosis.
 It is used in Filtering.
 It is used in Feature Learning.
 It is used in Classification.
 It is used in Risk Detection.
 It is used in Business and Economic analysis.

How do Restricted Boltzmann Machines work?


In RBM there are two phases through which the entire RBM works:
1st Phase: In this phase, we take the input layer and using the concept
of weights and biased we are going to activate the hidden layer. This
process is said to be Feed Forward Pass. In Feed Forward Pass we are
identifying the positive association and negative association.
Feed Forward Equation:
 Positive Association — When the association between the visible
unit and the hidden unit is positive.
 Negative Association — When the association between the visible
unit and the hidden unit is negative.
2nd Phase: As we don’t have any output layer. Instead of calculating the
output layer, we are reconstructing the input layer through the activated
hidden state. This process is said to be Feed Backward Pass. We are just
backtracking the input layer through the activated hidden neurons. After
performing this we have reconstructed Input through the activated hidden
state. So, we can calculate the error and adjust weight in this way:
Feed Backward Equation:
 Error = Reconstructed Input Layer-Actual Input layer
 Adjust Weight = Input*error*learning rate (0.1)
After doing all the steps we get the pattern that is responsible to activate
the hidden neurons. To understand how it works:
Let us consider an example in which we have some assumption that V1
visible unit activates the h1 and h2 hidden unit and V2 visible unit
activates the h2 and h3 hidden. Now when any new visible unit let V5 has
come into the machine and it also activates the h1 and h2 unit. So, we
can back trace the hidden units easily and also identify that the
characteristics of the new V5 neuron is matching with that of V1. This is
because V1 also activated the same hidden unit earlier.

Restricted Boltzmann Machines


Types of RBM :
There are mainly two types of Restricted Boltzmann Machine (RBM) based
on the types of variables they use:
1. Binary RBM: In a binary RBM, the input and hidden units are binary
variables. Binary RBMs are often used in modeling binary data such
as images or text.
2. Gaussian RBM: In a Gaussian RBM, the input and hidden units are
continuous variables that follow a Gaussian distribution. Gaussian
RBMs are often used in modeling continuous data such as audio
signals or sensor data.
Apart from these two types, there are also variations of RBMs such as:
1. Deep Belief Network (DBN): A DBN is a type of generative model
that consists of multiple layers of RBMs. DBNs are often used in
modeling high-dimensional data such as images or videos.
2. Convolutional RBM (CRBM): A CRBM is a type of RBM that is
designed specifically for processing images or other grid-like
structures. In a CRBM, the connections between the input and
hidden units are local and shared, which makes it possible to
capture spatial relationships between the input units.
3. Temporal RBM (TRBM): A TRBM is a type of RBM that is designed
for processing temporal data such as time series or video frames. In
a TRBM, the hidden units are connected across time steps, which
allows the network to model temporal dependencies in the data.

You might also like