Unit 1 Part 1
Unit 1 Part 1
In the fast-evolving era of artificial intelligence, Deep Learning stands as a cornerstone technology,
revolutionizing how machines understand, learn, and interact with complex data. At its essence,
Deep Learning AI mimics the intricate neural networks of the human brain, enabling computers to
autonomously discover patterns and make decisions from vast amounts of unstructured data. This
transformative field has propelled breakthroughs across various domains, from computer vision and
natural language processing to healthcare diagnostics and autonomous driving.
As we dive into this introductory exploration of Deep Learning, we uncover its foundational
principles, applications, and the underlying mechanisms that empower machines to achieve human-
like cognitive abilities. This article serves as a gateway into understanding how Deep Learning is
reshaping industries, pushing the boundaries of what’s possible in AI, and paving the way for a future
where intelligent systems can perceive, comprehend, and innovate autonomously.
The definition of Deep learning is that it is the branch of machine learning that is based on artificial
neural network architecture. An artificial neural network or ANN uses layers of interconnected nodes
called neurons that work together to process and learn from the input data.
In a fully connected Deep neural network, there is an input layer and one or more hidden layers
connected one after the other. Each neuron receives input from the previous layer neurons or the
input layer. The output of one neuron becomes the input to other neurons in the next layer of the
network, and this process continues until the final layer produces the output of the network. The
layers of the neural network transform the input data through a series of nonlinear transformations,
allowing the network to learn complex representations of the input data.
Scope of Deep Learning
Today Deep learning AI has become one of the most popular and visible areas of machine learning,
due to its success in a variety of applications, such as computer vision, natural language processing,
and Reinforcement learning.
Deep learning AI can be used for supervised, unsupervised as well as reinforcement machine
learning. it uses a variety of ways to process these.
Artificial neural networks are built on the principles of the structure and operation of human
neurons. It is also known as neural networks or neural nets. An artificial neural network’s input layer,
which is the first layer, receives input from external sources and passes it on to the hidden layer,
which is the second layer. Each neuron in the hidden layer gets information from the neurons in the
previous layer, computes the weighted total, and then transfers it to the neurons in the next layer.
These connections are weighted, which means that the impacts of the inputs from the preceding
layer are more or less optimized by giving each input a distinct weight. These weights are then
adjusted during the training process to enhance the performance of the model.
Fully Connected Artificial Neural Network
Artificial neurons, also known as units, are found in artificial neural networks. The whole Artificial
Neural Network is composed of these artificial neurons, which are arranged in a series of layers. The
complexities of neural networks will depend on the complexities of the underlying patterns in the
dataset whether a layer has a dozen units or millions of units. Commonly, Artificial Neural Network
has an input layer, an output layer as well as hidden layers. The input layer receives data from the
outside world which the neural network needs to analyze or learn about.
In a fully connected artificial neural network, there is an input layer and one or more hidden layers
connected one after the other. Each neuron receives input from the previous layer neurons or the
input layer. The output of one neuron becomes the input to other neurons in the next layer of the
network, and this process continues until the final layer produces the output of the network. Then,
after passing through one or more hidden layers, this data is transformed into valuable data for the
output layer. Finally, the output layer provides an output in the form of an artificial neural network’s
response to the data that comes in.
Units are linked to one another from one layer to another in the bulk of neural networks. Each of
these links has weights that control how much one unit influences another. The neural network
learns more and more about the data as it moves from one unit to another, ultimately producing an
output from the output layer.
Takes less time to train the model. Takes more time to train the model.
A model is created by relevant features which Relevant features are automatically extracted
are manually extracted from images to detect from images. It is an end-to-end learning
an object in the image. process.
Deep Learning models are able to automatically learn features from the data, which makes them
well-suited for tasks such as image recognition, speech recognition, and natural language processing.
The most widely used architectures in deep learning are feedforward neural networks, convolutional
neural networks (CNNs), and recurrent neural networks (RNNs).
1. Feedforward neural networks (FNNs) are the simplest type of ANN, with a linear flow of
information through the network. FNNs have been widely used for tasks such as image
classification, speech recognition, and natural language processing.
2. Convolutional Neural Networks (CNNs) are specifically for image and video recognition tasks.
CNNs are able to automatically learn features from the images, which makes them well-
suited for tasks such as image classification, object detection, and image segmentation.
3. Recurrent Neural Networks (RNNs) are a type of neural network that is able to process
sequential data, such as time series and natural language. RNNs are able to maintain an
internal state that captures information about the previous inputs, which makes them well-
suited for tasks such as speech recognition, natural language processing, and language
translation.
The main applications of deep learning AI can be divided into computer vision, natural language
processing (NLP), and reinforcement learning.
1. Computer vision
The first Deep Learning applications is Computer vision. In computer vision, Deep learning AI models
can enable machines to identify and understand visual data. Some of the main applications of deep
learning in computer vision include:
Object detection and recognition: Deep learning model can be used to identify and locate
objects within images and videos, making it possible for machines to perform tasks such as
self-driving cars, surveillance, and robotics.
Image classification: Deep learning models can be used to classify images into categories
such as animals, plants, and buildings. This is used in applications such as medical imaging,
quality control, and image retrieval.
Image segmentation: Deep learning models can be used for image segmentation into
different regions, making it possible to identify specific features within images.
In Deep learning applications, second application is NLP. NLP, the Deep learning model can enable
machines to understand and generate human language. Some of the main applications of deep
learning in NLP include:
Automatic Text Generation – Deep learning model can learn the corpus of text and new text
like summaries, essays can be automatically generated using these trained models.
Language translation: Deep learning models can translate text from one language to
another, making it possible to communicate with people from different linguistic
backgrounds.
Sentiment analysis: Deep learning models can analyze the sentiment of a piece of text,
making it possible to determine whether the text is positive, negative, or neutral. This is used
in applications such as customer service, social media monitoring, and political analysis.
Speech recognition: Deep learning models can recognize and transcribe spoken words,
making it possible to perform tasks such as speech-to-text conversion, voice search, and
voice-controlled devices.
3. Reinforcement learning:
In reinforcement learning, deep learning works as training agents to take action in an environment to
maximize a reward. Some of the main applications of deep learning in reinforcement learning
include:
Game playing: Deep reinforcement learning models have been able to beat human experts
at games such as Go, Chess, and Atari.
Robotics: Deep reinforcement learning models can be used to train robots to perform
complex tasks such as grasping objects, navigation, and manipulation.
Control systems: Deep reinforcement learning models can be used to control complex
systems such as power grids, traffic management, and supply chain optimization.
Deep learning has made significant advancements in various fields, but there are still some
challenges that need to be addressed. Here are some of the main challenges in deep learning:
1. Data availability: It requires large amounts of data to learn from. For using deep learning it’s
a big concern to gather as much data for training.
4. Interpretability: Deep learning models are complex, it works like a black box. it is very
difficult to interpret the result.
5. Overfitting: when the model is trained again and again, it becomes too specialized for the
training data, leading to overfitting and poor performance on new data.
1. High accuracy: Deep Learning algorithms can achieve state-of-the-art performance in various
tasks, such as image recognition and natural language processing.
2. Automated feature engineering: Deep Learning algorithms can automatically discover and
learn relevant features from data without the need for manual feature engineering.
3. Scalability: Deep Learning models can scale to handle large and complex datasets, and can
learn from massive amounts of data.
4. Flexibility: Deep Learning models can be applied to a wide range of tasks and can handle
various types of data, such as images, text, and speech.
5. Continual improvement: Deep Learning models can continually improve their performance
as more data becomes available.
1. High computational requirements: Deep Learning AI models require large amounts of data
and computational resources to train and optimize.
2. Requires large amounts of labeled data: Deep Learning models often require a large amount
of labeled data for training, which can be expensive and time- consuming to acquire.
4. Black-box nature: Deep Learning models are often treated as black boxes, making it difficult
to understand how they work and how they arrived at their predictions.
Conclusion
In conclusion, the field of Deep Learning represents a transformative leap in artificial intelligence. By
mimicking the human brain’s neural networks, Deep Learning AI algorithms have revolutionized
industries ranging from healthcare to finance, from autonomous vehicles to natural language
processing. As we continue to push the boundaries of computational power and dataset sizes, the
potential applications of Deep Learning are limitless. However, challenges such as interpretability and
ethical considerations remain significant. Yet, with ongoing research and innovation, Deep Learning
promises to reshape our future, ushering in a new era where machines can learn, adapt, and solve
complex problems at a scale and speed previously unimaginable.
2. Nonlinear Transformations
Purpose: Nonlinear activation functions (e.g., ReLU, Sigmoid, Tanh) enable networks
to model complex, non-linear relationships.
Principle: Use activation functions that balance computational efficiency and
gradient propagation (ReLU is often preferred).
3. Parameter Sharing
Concept: Techniques like convolution in Convolutional Neural Networks (CNNs) share
parameters across spatial dimensions.
Principle: Parameter sharing reduces the number of learnable parameters and helps
encode inductive biases like translational invariance.
4. Regularization
Purpose: Prevent overfitting and improve generalization.
o Techniques include L1/L2 regularization, dropout, and batch normalization.
Principle: Incorporate regularization to balance training accuracy with test
performance.
5. Optimization Efficiency
Concept: Deep networks rely on optimization algorithms (e.g., SGD, Adam) to update
weights.
Principle: Use gradient-based optimization methods with adaptive learning rates for
efficient convergence.
6. Modularity
Design Philosophy: Architectures are built as modular blocks (e.g., residual blocks in
ResNets, transformers in NLP).
Principle: Modularity aids in scalability, debugging, and transferability.
7. Skip Connections
Purpose: Mitigate vanishing gradient problems and improve gradient flow by
allowing direct information paths.
o E.g., Residual Networks (ResNets) use skip connections.
Principle: Use skip or residual connections in very deep networks.
8. Attention Mechanisms
Use: Assign different importance to different parts of the input.
o E.g., Self-attention in Transformers.
Principle: Incorporate attention for tasks requiring a focus on specific data regions
(e.g., NLP, vision).
9. Scalability
Design: Architectures should scale well with increasing data, layers, and hardware
(e.g., efficient models like MobileNets for low-power devices).
Principle: Adapt architectures for both large-scale and resource-constrained
environments.
- **Weights**: These are the coefficients that represent the strength of the connections
between neurons in different layers. They are multiplied by input data to propagate
information forward through the network.
- **Biases**: Bias terms are added to the weighted sum of inputs to adjust the output of the
neuron independently of the inputs. They help the network learn patterns that don’t pass
through the origin (zero point).
**Key Properties**:
- Parameters are optimized using algorithms like **gradient descent** during training.
- The number of parameters depends on the architecture of the network and the number of
neurons in each layer.
2. **Hidden Layers**: These layers perform the bulk of the computation and learning in the
network. They consist of multiple neurons that apply an activation function to the weighted
inputs.
- **Fully Connected (Dense) Layer**: Each neuron is connected to all neurons in the
previous and next layers.
- **Convolutional Layer (Conv Layer)**: Used in Convolutional Neural Networks (CNNs), it
applies filters to the input data to capture spatial hierarchies in images.
- **Recurrent Layer (RNN Layer)**: Used in Recurrent Neural Networks (RNNs) for
sequential data, where the output of one step is fed into the next.
- **Pooling Layer**: Often used in CNNs, it reduces the dimensionality of the input by
downsampling (e.g., max pooling, average pooling).
3. **Output Layer**: The final layer that produces the output of the network. In
classification tasks, the output layer typically uses a **softmax** or **sigmoid** activation
function to produce probabilities.
**Activation Functions**:
- **ReLU (Rectified Linear Unit)**: Popular due to its simplicity and effectiveness, it
introduces non-linearity by outputting the input if positive and zero otherwise.
- **Sigmoid**: Used for binary classification, it squashes values between 0 and 1.
- **Tanh**: Similar to sigmoid but squashes values between -1 and 1.
The total number of parameters is computed based on the number of connections and
biases between layers.
Let me know if you'd like more details about a specific type of network or how to calculate
parameters for a given architecture!
In the process of building a neural network, one of the choices you get to make is what Activation
Function to use in the hidden layer as well as at the output layer of the network. This article
discusses Activation functions in Neural Networks.
An activation function in the context of neural networks is a mathematical function applied to the
output of a neuron. The purpose of an activation function is to introduce non-linearity into the
model, allowing the network to learn and represent complex patterns in the data. Without non-
linearity, a neural network would essentially behave like a linear regression model, regardless of the
number of layers it has.
The activation function decides whether a neuron should be activated or not by calculating the
weighted sum and further adding bias to it. The purpose of the activation function is to introduce
non-linearity into the output of a neuron.
Explanation: We know, the neural network has neurons that work in correspondence with weight,
bias, and their respective activation function. In a neural network, we would update the weights and
biases of the neurons on the basis of the error at the output. This process is known as back-
propagation. Activation functions make the back-propagation possible since the gradients are
supplied along with the error to update the weights and biases.
Input Layer: This layer accepts input features. It provides information from the outside world to the
network, no computation is performed at this layer, nodes here just pass on the
information(features) to the hidden layer.
Hidden Layer: Nodes of this layer are not exposed to the outer world, they are part of the
abstraction provided by any neural network. The hidden layer performs all sorts of computation on
the features entered through the input layer and transfers the result to the output layer.
Output Layer: This layer bring up the information learned by the network to the outer world.
A neural network without an activation function is essentially just a linear regression model. The
activation function does the non-linear transformation to the input making it capable to learn and
perform more complex tasks.
Mathematical proof
Here,
W(1) be the vectorized weights assigned to neurons of hidden layer i.e. w1, w2, w3 and w4
a(2) = z(2)
Let,
[W(2) * W(1)] = W
[W(2)*b(1) + b(2)] = b
This observation results again in a linear function even after applying a hidden layer, hence we can
conclude that, doesn’t matter how many hidden layer we attach in neural net, all layers will behave
same way because the composition of two linear function is a linear function itself. Neuron can not
learn with just a linear function attached to it. A non-linear activation function will let it learn as per
the difference w.r.t error. Hence we need an activation function.
Linear Function
Equation : Linear function has the equation similar to as of a straight line i.e. y = x
No matter how many layers we have, if all are linear in nature, the final activation function of
last layer is nothing but just a linear function of the input of first layer.
Range : -inf to +inf
Uses : Linear activation function is used at just one place i.e. output layer.
Issues : If we will differentiate linear function to bring non-linearity, result will no more
depend on input “x” and function will become constant, it won’t introduce any ground-
breaking behavior to our algorithm.
For example : Calculation of price of a house is a regression problem. House price may have any
big/small value, so we can apply linear activation at output layer. Even in this case neural net must
have any non-linear function at hidden layers.
Sigmoid Function
Nature : Non-linear. Notice that X values lies between -2 to 2, Y values are very steep. This
means, small changes in x would also bring about large changes in the value of Y.
Value Range : 0 to 1
Uses : Usually used in output layer of a binary classification, where result is either 0 or 1, as
value for sigmoid function lies between 0 and 1 only so, result can be predicted easily to
be 1 if value is greater than 0.5 and 0 otherwise.
Tanh Function
The activation that works almost always better than sigmoid function is Tanh function also
known as Tangent Hyperbolic function. It’s actually mathematically shifted version of the
sigmoid function. Both are similar and can be derived from each other.
Equation :-
f(x) = tanh(x) = 2/(1 + e-2x) – 1
OR
tanh(x) = 2 * sigmoid(2x) – 1
Value Range :- -1 to +1
Nature :- non-linear
Uses :- Usually used in hidden layers of a neural network as it’s values lies between -1 to
1 hence the mean for the hidden layer comes out be 0 or very close to it, hence helps
in centering the data by bringing mean close to 0. This makes learning for the next layer
much easier.
RELU Function
It Stands for Rectified linear unit. It is the most widely used activation function. Chiefly
implemented in hidden layers of Neural network.
Nature :- non-linear, which means we can easily backpropagate the errors and have multiple
layers of neurons being activated by the ReLU function.
Uses :- ReLu is less computationally expensive than tanh and sigmoid because it involves
simpler mathematical operations. At a time only a few neurons are activated making the
network sparse making it efficient and easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
Softmax Function
The softmax function is also a type of sigmoid function but is handy when we are trying to handle
multi- class classification problems.
Nature :- non-linear
Uses :- Usually used when trying to handle multiple classes. the softmax function was
commonly found in the output layer of image classification problems.The softmax function
would squeeze the outputs for each class between 0 and 1 and would also divide by the sum
of the outputs.
Output:- The softmax function is ideally used in the output layer of the classifier where we
are actually trying to attain the probabilities to define the class of each input.
The basic rule of thumb is if you really don’t know what activation function to use, then
simply use RELU as it is a general activation function in hidden layers and is used in most
cases these days.
If your output is for binary classification then, sigmoid function is very natural choice for
output layer.
If your output is for multi-class classification then, Softmax is very useful to predict the
probabilities of each classes.
Forward propagation
While forward propagation refers to the computational process of predicting an output for a given
input vector x, backpropagation and gradient descent describe the process of improving the weights
and biases of the network in order to make better predictions. Let’s look at this in practice.
For a given input vector x the neural network predicts an output, which is generally called a
prediction vector y.
Feedforward
neural network.
The equations describing the mathematics happening during the prediction vector’s computation
looks like this:
Forward propagation.
We must compute a dot-product between the input vector x and the weight matrix W1 that connects
the first layer with the second. After that, we apply a non-linear activation function to the result of
the dot-product.
A loss function measures how good a neural network model is in performing a certain task,
which in most cases is regression or classification.
We must minimize the value of the loss function during the backpropagation step in order to
make the neural network better.
We only use the cross-entropy loss function in classification tasks when we want the neural
network to predict probabilities.
For regression tasks, when we want the network to predict continuous numbers, we must
use the mean squared error loss function.
We use mean absolute percentage error loss function during demand forecasting to keep an
eye on the performance of the network during training time.
The prediction vector can represent a number of things depending on the task we want the network
to do. For regression tasks, which are basically predictions of continuous variables (e.g. stock price,
expected demand for products, etc.), the output vector y contains continuous numbers.
Regardless of the task, we somehow have to measure how close our predictions are to the ground
truth label.
On the other hand, for classification tasks, such as customer segmentation or image classification,
the output vector y represents probability scores between 0.0 and 1.0.
The value we want the neural network to predict is called a ground truth label, which is usually
represented as y_hat. A predicted value y closer to the label suggests a better performance of the
neural network.
Regardless of the task, we somehow have to measure how close our predictions are to the ground
truth label.
Since the prediction vector y(θ) is a function of the neural network’s weights (which we abbreviate to
θ), the loss is also a function of the weights.
Since the loss depends on weights, we must find a certain set of weights for which the value of the
loss function is as small as possible. We achieve this mathematically through a method
called gradient descent.
The value of this loss function depends on the difference between the label y_hat and y. A higher
difference means a higher loss value while (you guessed it) a smaller difference means a smaller loss
value. Minimizing the loss function directly leads to more accurate predictions of the neural network
as the difference between the prediction and the label decreases.
In fact, the neural network’s only objective is to minimize the loss function. This is because
minimizing the loss function automatically causes the neural network model to make better
predictions regardless of the exact characteristics of the task at hand.
A neural network solves tasks without being explicitly programmed with a task-specific rule. This is
possible because the goal of minimizing the loss function is universal and doesn’t depend on the task
or circumstances.
That said, you still have to select the right loss function for the task at hand. Luckily there are only
three loss functions you need to know to solve almost any problem.
Mean squared error (MSE) loss function is the sum of squared differences between the entries in the
prediction vector y and the ground truth vector y_hat.
MSE loss function
You divide the sum of squared differences by N, which corresponds to the length of the vectors. If
the output y of your neural network is a vector with multiple entries then N is the number of the
vector entries with y_i being one particular entry in the output vector.
The mean squared error loss function is the perfect loss function if you're dealing with a regression
problem. That is, if you want your neural network to predict a continuous scalar value.
a stock value.
import numpy as np
y_hat= [1, 1, 2, 2, 4]
Regression is only one of two areas where feedforward networks enjoy great popularity. The other
area is classification.
In classification tasks, we deal with predictions of probabilities, which means the output of a neural
network must be in a range between zero and one. A loss function that can measure the error
between a predicted probability and the label which represents the actual class is called the cross-
entropy loss function.
One important thing we need to discuss before continuing with the cross-entropy is what exactly the
ground truth vector looks like in the case of a classification problem.
The label vector y_hat is one hot encoded which means the values in this vector can only take
discrete values of either zero or one. The entries in this vector represent different classes. The values
of these entries are zero, except for a single entry which is one. This entry tells us the class into which
we want to classify the input feature vector x.
The prediction y, however, can take continuous values between zero and one.
Given the prediction vector y and the ground truth vector y_hat you can compute the cross-entropy
loss between those two vectors as follows:
First, we need to sum up the products between the entries of the label vector y_hat and the
logarithms of the entries of the predictions vector y. Then we must negate the sum to get a positive
value of the loss function.
One interesting thing to consider is the plot of the cross-entropy loss function. In the following graph,
you can see the value of the loss function (y-axis) vs. the predicted probability y_i. Here y_i takes
values between zero and one.
We can see clearly that the cross-entropy loss function grows exponentially for lower values of the
predicted probability y_i. For y_i=0 the function becomes infinite, while for y_i=1 the neural network
makes an accurate probability prediction and the loss value goes to zero.
Here’s another code snippet in Python where I’ve calculated the cross-entropy loss function:
import numpy as np
y_hat =[0, 1, 0, 0]
cross_entropy = - np.sum(np.log(y_pred)*y_hat)
Read More From Our Machine Learning ExpertsMachine Learning for Beginners
Finally, we come to the Mean Absolute Percentage Error (MAPE) loss function. This loss function
doesn’t get much attention in deep learning. For the most part, we use it to measure the
performance of a neural network during demand forecasting tasks.
Demand forecasting is the area of predictive analytics dedicated to predicting the expected demand
for a good or service in the near future. For example:
In retail, we can use demand forecasting models to determine the amount of a particular
product that should be available and at what price.
In industrial manufacturing, we can predict how much of each product should be produced,
the amount of stock that should be available at various points in time, and when
maintenance should be performed.
In the travel and tourism industry, we can use demand forecasting models to assess optimal
price points for flights and hotels, in light of available capacity, what price should be assigned
(for hotels, flights), which destinations should be spotlighted, or, what types of packages
should be advertised.
Although demand forecasting is also a regression task and the minimization of the MSE loss function
is an adequate training goal, this type of loss function to measure the performance of the model
during training isn’t suitable for demand forecasting.
Why is that?
Well, imagine the MSE loss function gives you a value of 100. Can you tell if this is generally a good
result? No, because it depends on the situation. If the prediction y of the model is 1000 and the
actual ground truth label y_hat is 1010, then the MSE loss of 100 would be in fact a very small error
and the performance of the model would be quite good.
However in the case where the prediction would be five and the label is 15, you would have the
same loss value of 100 but the relative deviation to the ground-truth value would be much higher
than in the previous case.
This example shows the shortcoming of the mean squared error function as the loss function for the
demand forecasting models. For this reason, I strongly recommend using mean absolute percentage
error (MAPE).
The mean absolute percentage error, also known as mean absolute percentage deviation (MAPD)
usually expresses accuracy as a percentage. We define it with the following equation:
In this equation, y_i is the predicted value and y_hat is the label. We divide the difference between
y_i and y_hat by the actual value y_hat again. Finally, multiplying by 100 percent gives us the
percentage error.
Applying this equation to the example above gives you a more meaningful understanding of the
model’s performance. In the first case, the deviation from the ground truth label would be only one
percent, while in the second case the deviation would be 66 percent:
We see that the performance of these two models is very different. Meanwhile, the MSE loss
function would indicate that the performance of both models is the same.
What is Optimizer?
network’s parameters during training. Its primary role is to minimize the model’s
on the feedback received from the data. Well-known optimizers in deep learning
equipped with distinct update rules, learning rates, and momentum strategies, all
geared towards the overarching goal of discovering and converging upon optimal
Optimizer algorithms are optimization method that helps improve a deep learning
the accuracy and speed training of the deep learning model. But first of all, the
While training the deep learning optimizers model, modify each epoch’s weights
adjusts the attributes of the neural network, such as weights and learning rates.
Thus, it helps in reducing the overall loss and improving accuracy. The problem
of choosing the right weights for the model is a daunting task, as a deep learning
You can use different optimizers in the machine learning model to change your
weights and learning rate. However, choosing the best optimizer depends upon
the application. As a beginner, one evil thought that comes to mind is that we try
all the possibilities and choose the one that shows the best results. This might be
fine initially, but when dealing with hundreds of gigabytes of data, even a single
than gambling with your precious time that you will realize sooner or later in your
journey.
Adam. By the end of the article, you can compare various optimizers and the
Before proceeding, there are a few terms that you should be familiar with.
Epoch – The number of times the algorithm runs on the whole training
dataset.
model parameters.
cost, which is the difference between the predicted value and the actual
value.
Weights/ Bias – The learnable parameters in a model that controls the
Gradient Descent can be considered the popular kid among the class of
consistently modify the values and achieve the local minimum. Before moving
In simple terms, consider you are holding a ball resting at the top of a bowl.
When you lose the ball, it goes along the steepest direction and eventually
settles at the bottom of the bowl. A Gradient provides the ball in the steepest
direction to reach the local minimum which is the bottom of the bowl.
The above equation means how the gradient is calculated. Here alpha is the step
size that represents how far to move against each gradient with each iteration.
3. Search for Lower Cost: Look for a cost value lower than the current one.
coefficients’ values.
5. Repeat Process: Continue this process iteratively.
Gradient descent works best for most purposes. However, it has some
downsides too. It is expensive to calculate the gradients if the size of the data is
huge. Gradient descent works well for convex functions, but it doesn’t know how
At the end of the previous section, you learned why there might be better options
than using gradient descent on massive data. To tackle the challenges large
instead of processing the entire dataset during each iteration, we randomly select
batches of data. This implies that only a few samples from the dataset are
The procedure is first to select the initial parameters w and learning rate n. Then
Since we are not using the whole dataset but the batches of it for each iteration,
the path taken by the algorithm is full of noise as compared to the gradient
descent algorithm. Thus, SGD uses a higher number of iterations to reach the
computation time increases. But even after increasing the number of iterations,
the computation cost is still less than that of the gradient descent optimizer. So
descent algorithm.
As discussed in the earlier section, you have learned that stochastic gradient
descent takes a much more noisy path than the gradient descent algorithm when
number of iterations to reach the optimal minimum, and hence, computation time
is very slow. To overcome the problem, we use stochastic gradient descent with
a momentum algorithm.
What the momentum does is helps in faster convergence of the loss function.
and updates the weights accordingly. However, adding a fraction of the previous
update to the current update will make the process a bit faster. One thing that
should be remembered while using this algorithm is that the learning rate should
In the above image, the left part shows the convergence graph of the stochastic
gradient descent algorithm. At the same time, the right side shows SGD with
momentum. From the image, you can compare the path chosen by both
algorithms and realize that using momentum helps reach convergence in less
time. You might be thinking of using a large momentum and learning rate to
make the process even faster. But remember that while increasing the
momentum, the possibility of passing the optimal minimum also increases. This
In this variant of gradient descent, instead of taking all the training data, only a
subset of the dataset is used for calculating the loss function. Since we are using
a batch of data instead of taking the whole dataset, fewer iterations are needed.
That is why the mini-batch gradient descent algorithm is faster than both
algorithm is more efficient and robust than the earlier variants of gradient
descent. As the algorithm uses batching, all the training data need not be loaded
in the memory, thus making the process more efficient to implement. Moreover,
the cost function in mini-batch gradient descent is noisier than the batch gradient
descent algorithm but smoother than that of the stochastic gradient descent
Despite all that, the mini-batch gradient descent algorithm has some downsides
be appropriate for almost every case. Also, in some cases, it results in poor final
accuracy. Due to this, there needs a rise to look for other alternatives too.
Adagrad (Adaptive Gradient Descent) Deep Learning
Optimizer
The adaptive gradient descent algorithm is slightly different from other gradient
descent algorithms. This is because it uses different learning rates for each
iteration. The change in learning rate depends upon the difference in the
parameters during training. The more the parameters get changed, the more
minor the learning rate changes. This modification is highly beneficial because
have the same value of learning rate for all the features. The Adagrad algorithm
uses the below formula to update the weights. Here the alpha(t) denotes the
to avoid division by 0.
The benefit of using Adagrad is that it abolishes the need to modify the learning
rate manually. It is more reliable than gradient descent algorithms and their
One downside of the AdaGrad optimizer is that it decreases the learning rate
aggressively and monotonically. There might be a point when the learning rate
increasing. Due to small learning rates, the model eventually becomes unable to
acquire more knowledge, and hence the accuracy of the model is compromised.
RMS Prop (Root Mean Square) Deep Learning
Optimizer
RMS prop is one of the popular optimizers among deep learning enthusiasts.
This is maybe because it hasn’t been published but is still very well-known in the
the problem of varying gradients. The problem with the gradients is that some of
them were small while others may be huge. So, defining a single learning rate
might not be the best idea. RPPROP uses the gradient sign, adapting the step
size individually for each weight. In this algorithm, the two gradients are first
compared for signs. If they have the same sign, we’re going in the right direction,
increasing the step size by a small fraction. If they have opposite signs, we must
decrease the step size. Then we limit the step size and can now go for the
weight update.
The problem with RPPROP is that it doesn’t work well with large datasets and
decreasing the number of function evaluations to reach the local minimum. The
algorithm keeps the moving average of squared gradients for every weight and
In simpler terms, if there exists a parameter due to which the cost function
oscillates a lot, we want to penalize the update of this parameter. Suppose you
built a model to classify a variety of fishes. The model relies on the factor ‘color’
mainly to differentiate between the fishes. Due to this, it makes a lot of errors.
What RMS Prop does is, penalize the parameter ‘color’ so that it can rely on
other features too. This prevents the algorithm from adapting too quickly to
algorithms. The algorithm converges quickly and requires lesser tuning than
The problem with RMS Prop is that the learning rate has to be defined manually,
based upon adaptive learning and is designed to deal with significant drawbacks
of AdaGrad and RMS prop optimizer. The main problem with the above two
optimizers is that the initial learning rate must be defined manually. One other
some point. Due to this, a certain number of iterations later, the model can no
leaky average of the second moment gradient and a leaky average of the second
Here St and delta Xt denote the state variables, g’t denotes rescaled gradient,
delta Xt-1 denotes squares rescaled gradients, and epsilon represents a small
The name “Adam” is derived from “adaptive moment estimation,” highlighting its
ability to adaptively adjust the learning rate for each network weight individually.
Unlike SGD, which maintains a single learning rate throughout training, Adam
Adam optimizer considers the second moment of the gradients, but unlike
that can efficiently navigate the optimization landscape during training. This
network.
The adam optimizer has several benefits, due to which it is used widely. It is
implement, has a faster running time, low memory requirements, and requires
The above formula represents the working of adam optimizer. Here B1 and B2
If the adam optimizer uses the good properties of all the algorithms and is the
best available optimizer, then why shouldn’t you use Adam in every application?
And what was the need to learn about other algorithms in depth? This is because
even Adam has some downsides. It tends to focus on faster computation time,
whereas algorithms like stochastic gradient descent focus on data points. That’s
why algorithms like SGD generalize the data in a better manner at the cost of low
Hands-on Optimizers
analysis. It’s time to try what we have learned and compare the results by
keeping things simple, what’s better than the MNIST dataset? We will train a
simple model using some basic layers, keeping the batch size and epochs the
same but with different optimizers. For the sake of fairness, we will use the
num_classes=10
epochs=10
def build_model(optimizer):
model=Sequential()
model.add(Conv2D(32,kernel_size=(3,3),activation='relu',input_shape=input_shape)
)
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
return model
Train the Model
optimizers = ['Adadelta', 'Adagrad', 'Adam', 'RMSprop', 'SGD']
for i in optimizers:
model = build_model(i)
We have run our model with a batch size of 64 for 10 epochs. After trying the
different optimizers, the results we get are pretty interesting. Before analyzing
the results, what do you think will be the best optimizer for this dataset?
Table Analysis
The above table shows the validation accuracy and loss at different epochs. It
also contains the total time that the model took to run on 10 epochs for each
optimizer. From the above table, we can make the following analysis.
time.
good results as well. But to reach the accuracy of the Adam optimizer,
SGD will require more iterations, and hence the computation time will
increase.
larger computation time. This means the value of momentum taken needs
to be optimized.
Adadelta shows poor results both with accuracy and computation time.
You can analyze the accuracy of each optimizer with each epoch from the below
graph.
We’ve now reached the end of this comprehensive guide. To refresh your
Summary
SGD is a very basic algorithm and is hardly used in applications now due to its
slow computation speed. One more problem with that algorithm is the constant
learning rate for every epoch. Moreover, it is not able to handle saddle points
very well. Adagrad works better than stochastic gradient descent generally due
to frequent updates in the learning rate. It is best when used for dealing with
sparse data. RMSProp shows similar results to that of the gradient descent
algorithm with momentum, it just differs in the way by which the gradients are
calculated.
Lastly comes the Adam optimizer that inherits the good features of RMSProp and
other algorithms. The results of the Adam optimizer are generally better than
every other optimization algorithm, have faster computation time, and require
fewer parameters for tuning. Because of all that, Adam is recommended as the
default optimizer for most of the applications. Choosing the Adam optimizer for
your application might give you the best probability of getting the best results.
But by the end, we learned that even Adam optimizer has some downsides. Also,
there are cases when algorithms like SGD might be beneficial and perform better
and the type of data you are dealing with to choose the best optimization
Conclusion
Key Takeaways
Descent, Adagrad, RMS Prop, AdaDelta, and Adam are all popular deep-
learning optimizers.
Each optimizer has its own strengths and weaknesses, and the choice of
The choice of optimizer can significantly impact the speed and quality of
learning model.
hyperparameters are used to improve the learning of the model, and their values
In this topic, we are going to discuss one of the most important concepts of
Hyperparameter.
that are used by learning algorithms to provide the best result. So, what are
parameters that are explicitly defined by the user to control the learning
process."
Here the prefix "hyper" suggests that the parameters are top-level parameters
that are used in controlling the learning process. The value of the
Hyperparameter is selected and set by the machine learning engineer before the
learning algorithm begins training the model. Hence, these are external to the
model, and their values cannot be changed during the training process.
o Batch Size
o Number of Epochs
model hyperparameters. So, in order to clear this confusion, let's understand the
difference between both of them and how they are related to each other.
Model Parameters:
Model parameters are configuration variables that are internal to the model, and
neural network, cluster centroid in clustering. Some key points for model
o These are the part of the model and key to a machine learning Algorithm.
Model Hyperparameters:
Hyperparameters are those parameters that are explicitly defined by the user to
control the learning process. Some key points for model parameters are as
follows:
o These are usually defined manually by the machine learning engineer.
o One cannot know the exact best value for hyperparameters for the given
problem. The best value can be determined either by the rule of thumb or
Categories of Hyperparameters
Broadly hyperparameters can be divided into two categories, which are given
below:
algorithms that controls how much the model needs to change in response
to the estimated error for each time when the model's weights are
task because if the learning rate is very less, then it may slow down the
training process. On the other hand, if the learning rate is too large, then it
o Batch Size: To enhance the speed of the learning process, the training
set is divided into different subsets, which are known as a batch. Number
of Epochs: An epoch can be defined as the complete cycle for training the
The number of epochs varies from model to model, and various models
are created with more than one epoch. To determine the right number of
Hyperparameters that are involved in the structure of the model are known as
neural network. It should be between the size of the input layer and the size of
the output layer. More specifically, the number of hidden units should be 2/3 of
the size of the input layer, plus the size of the output layer.
For complex functions, it is necessary to specify the number of hidden units, but
components, which are called layers. There are mainly input layers,
Conclusion
Hyperparameters are the parameters that are explicitly defined to control the
These are used to specify the learning capacity and complexity of the model.
Some of the hyperparameters are used for the optimization of the models, such
as Batch size, learning rate, etc., and some are specific to the models, such as
There are several popular frameworks to deploy deep learning networks, ranging
from end-to-end solutions that allow both model building and deployment to
specialized tools for deployment optimization. Here are the major frameworks:
1. TensorFlow Serving
environments.
Key Features:
deployments.
2. TensorFlow Lite
Key Features:
chips.
Key Features:
4. TorchServe
Key Features:
o Offers metrics for monitoring models and has native support for
environments.
5. Apache MXNet
Key Features:
as model quantization.
6. NVIDIA TensorRT
Key Features:
7. AWS SageMaker
Key Features:
Key Features:
ONNX.
o Provides tools for automating model management, version control,
and deployment.
deployment.
Microsoft Azure.
Key Features:
and monitoring.
Building Blocks
Building Blocks of Deep Neural Networks
Deep neural networks (DNNs) are composed of several key components, each
serving a specific function in learning, processing, and producing output. Here’s a
breakdown of the essential building blocks:
1. Neurons (Artificial Neurons)
Purpose: The basic units of a neural network that take input, process it, and
pass an output.
Components:
o Inputs: Features or data points.
o Weights (www): Coefficients that adjust the importance of each input.
o Bias (bbb): A trainable parameter that helps the model fit data better
by shifting the output.
o Activation Function (f(x)f(x)f(x)): Determines the output of a neuron
(e.g., ReLU, Sigmoid).
2. Layers
Input Layer:
o Purpose: Receives the input data.
o The number of neurons depends on the dimensionality of the input
data (e.g., pixels in an image).
Hidden Layers:
o Purpose: Perform computations and learn complex patterns in the
data.
o Consists of multiple neurons that transform inputs using weights,
biases, and activation functions.
o The depth of the network refers to the number of hidden layers.
Output Layer:
o Purpose: Produces the final output of the network (e.g., class
probabilities in classification).
o The number of neurons depends on the output format (e.g., binary
classification has one neuron with a Sigmoid function).
5. Loss Function
Purpose: Measures the difference between the predicted output and the
actual target.
Common Loss Functions:
o Mean Squared Error (MSE) for regression problems.
o Cross-Entropy Loss for classification problems.
6. Optimization Algorithms
Purpose: Adjust the weights and biases to minimize the loss function.
Common Optimizers:
Stochastic Gradient Descent (SGD)
Adam (Adaptive Moment Estimation)
RMSprop
7. Forward Propagation
The process by which input data passes through the network, layer by layer,
to produce an output prediction.
Involves multiplying inputs by weights, adding bias, and applying the
activation function.
9. Regularization Techniques
Prevent overfitting by reducing model complexity.
Common Regularization Methods:
L2 Regularization (Ridge Regression): Adds the squared magnitude of the
weights to the loss function.
L1 Regularization (Lasso Regression): Adds the absolute magnitude of the
weights to the loss function.
Dropout: Randomly disables neurons during training to improve
generalization.
Batch Normalization: Normalizes input across a batch to stabilize learning
and speed up convergence.
Summary
Neurons: Basic computational units.
Layers (Input, Hidden, Output): Organize the data flow through the network.
Activation Functions: Introduce non-linearity.
Loss Functions: Measure prediction accuracy.
Optimization Algorithms: Minimize the loss function.
Regularization Techniques: Prevent overfitting.
Forward and Backward Propagation: Train and refine the network.
Restricted Boltzmann Machine:-
Introduction :
Restricted Boltzmann Machine (RBM) is a type of artificial neural network
that is used for unsupervised learning. It is a type of generative model
that is capable of learning a probability distribution over a set of input
data.
RBM was introduced in the mid-2000s by Hinton and Salakhutdinov as a
way to address the problem of unsupervised learning. It is a type of neural
network that consists of two layers of neurons – a visible layer and a
hidden layer. The visible layer represents the input data, while the hidden
layer represents a set of features that are learned by the network.
The RBM is called “restricted” because the connections between the
neurons in the same layer are not allowed. In other words, each neuron in
the visible layer is only connected to neurons in the hidden layer, and vice
versa. This allows the RBM to learn a compressed representation of the
input data by reducing the dimensionality of the input.
The RBM is trained using a process called contrastive divergence, which is
a variant of the stochastic gradient descent algorithm. During training, the
network adjusts the weights of the connections between the neurons in
order to maximize the likelihood of the training data. Once the RBM is
trained, it can be used to generate new samples from the learned
probability distribution.
RBM has found applications in a wide range of fields, including computer
vision, natural language processing, and speech recognition. It has also
been used in combination with other neural network architectures, such
as deep belief networks and deep neural networks, to improve their
performance.
What are Boltzmann Machines?
It is a network of neurons in which all the neurons are connected to each
other. In this machine, there are two layers named visible layer or input
layer and hidden layer. The visible layer is denoted as v and the hidden
layer is denoted as the h. In Boltzmann machine, there is no output layer.
Boltzmann machines are random and generative neural networks capable
of learning internal representations and are able to represent and (given
enough time) solve tough combinatoric problems.
The Boltzmann distribution (also known as Gibbs Distribution) which is
an integral part of Statistical Mechanics and also explain the impact of
parameters like Entropy and Temperature on the Quantum States in
Thermodynamics. Due to this, it is also known as Energy-Based Models
(EBM). It was invented in 1985 by Geoffrey Hinton, then a Professor at
Carnegie Mellon University, and Terry Sejnowski, then a Professor at Johns
Hopkins University