DL Unit1 Notes
DL Unit1 Notes
(5 Marks)
Deep Learning (DL) is a subset of Machine Learning (ML) that allows us to train a model using a set
of inputs and then predict output.
It is essentially a neural network with three or more layers.
These neural networks attempt to simulate the behavior of the human brain allowing it to “learn” from
large amounts of data.
Deep learning drives many artificial intelligence (AI) applications and services that improve
automation, performing analytical and physical tasks without human intervention. Deep learning
technology lies behind everyday products and services (such as digital assistants, voice-enabled TV
remotes, and credit card fraud detection) as well as emerging technologies (such as self-driving cars).
Applications:
1. Self-driving Cars: One of the fascinating technologies, self-driving cars, are designed using
deep neural networks at a high level, where these cars use machine learning algorithms. They
detect objects around the car, the distance between the car and other vehicles, the footway
location, identify traffic signals, determine the driver's condition, etc. For example, Tesla is the
most reliable brand that brings automated, self-driving cars in the market.
2. Social Media: Twitter deploys deep learning algorithms to enhance their product. They access
and analyze a lot of data by the deep neural network to learn over time about the possibilities
of user preferences. Instagram uses deep learning to avoid cyberbullying, erasing annoying
comments. Facebook uses deep learning to recommend pages, friends, products, etc. Moreover,
Facebook uses the ANN algorithm for facial recognition that makes perfect tagging plausible.
3. Image classification/machine vision: We see Facebook providing a suggestion for auto-
tagging different persons in a picture is a perfect example of machine vision. It uses deep nets
and takes pictures at different angles, and then labels the name to that picture. These deep
learning models are now so advanced that we can recognize different objects in a picture and
can predict what could be the occasion in that picture. For example, a picture taken in the
restaurant has different features in it, like tables, chairs, different food items, knife, fork, glass,
beer (brand of the beer), the mood of the people in the picture, etc. By looking at the images
posted by a person can detect the likings of that person and recommend similar things to buy
or places to visit etc.
4. Speech Recognition: Speech is the most common method of communication in human society.
As a human recognize speech understands it and responds accordingly, the same way deep
learning model is enhancing the capabilities of computers so that they can understand how
humans do react to different speeches. In day-to-day life, we have live examples like Siri of
Apple, Alexa from Amazon, google home mini, etc. In the speech, there are lots of factors that
needed to be considered like language/ accent / Age / Gender/ sound quality, etc. The goal is to
recognize and respond to an unknown speaker by the input of his/her sound signals.
5. Market Prediction: Deep learning models can predict buy and sell calls for traders, depending
on the dataset how the model has been trained, it is useful for both short term trading game as
well as long term investment based on the available features.
1 Machine Learning is a superset of Deep Learning Deep Learning is a subset of Machine Learning
9 It takes more time to train the model. It takes less time to train the model.
Neuron
A neuron takes a group of weighted inputs, applies an activation function, and returns an output.
Inputs to a neuron can either be features from a training set or outputs from a previous layer’s
neurons. Weights are applied to the inputs as they travel along synapses to reach the neuron. The
neuron then applies an activation function to the “sum of weighted inputs” from each incoming
synapse and passes the result on to all the neurons in the next layer.
3. Define the basic terminologies used in Neural Network. (5 Marks)
Neural Network:
A neural network is a very powerful machine learning mechanism which basically mimics how a human
brain learns. The brain receives the stimulus from the outside world, does the processing on the input,
and then generates the output.
Terminologies:
1. Inputs: Inputs are the set of values for which we need to predict a output value. They can be
viewed as features or attributes in a dataset.
2. Weights: weights are the real values that are attached with each input/feature and they convey
the importance of that corresponding feature in predicting the final output. (will discuss about
this in-detail)
3. Bias: Bias is used for shifting the activation function towards left or right, you can compare this
to y-intercept in the line equation. (will discuss more about this)
4. Summation Function: The work of the summation function is to bind the weights and inputs
together and calculate their sum.
Single layer perceptron’s – also called as Single-layer Feed Forward Neural Network
Simplest feedforward neural network and does not contain any hidden layer. These can only learn
linearly separable patterns. Input nodes are fully connected to a node or multiple nodes in the succeeding
layer. Nodes in the next layer take a weighted sum of their inputs. Single-layer Perceptron is not able
to figure out the nonlinearity or complexity of the data.
They have a hidden layer and use sophisticated algorithms like backpropagation. A multilayer
perceptron generates a set of outputs from a set of inputs. A multilayer perceptron consists of input,
output, and hidden layers. Each hidden layer is made up of numerous perceptron’s which are known as
hidden layers or hidden unit. A multilayer perceptron (MLP) is a feed forward artificial neural network
that generates a set of outputs from a set of inputs. An MLP is characterized by several layers of input
nodes connected as a directed graph between the input nodes connected as a directed graph between the
input and output layers. MLP uses backpropagation for training the network. MLP is a deep learning
method. Multiple Hidden layers are used to find the nonlinearity of the data. This instruction is also
called a feed-forward network.
6. How a perceptron model works.
A perceptron is a neural network unit (an artificial neuron). Perceptron is a single layer neural network.
Perceptron is a linear classifier (binary). Also, it is used in supervised learning. It helps to classify the
given input data. Perceptron is usually used to classify the data into two parts. Therefore, it is also
known as a Linear Binary Classifier.
The perceptron consists of 4 parts: Input values or One input layer, Weights and Bias, Net sum and
Activation function.
b. Add all the multiplied values and call them Weighted Sum.
The activation function as a mathematical function that can normalize the inputs.
The term "Artificial Neural Network" is derived from Biological neural networks that develop the
structure of a human brain. Similar to the human brain that has neurons interconnected to one another,
artificial neural networks also have neurons that are interconnected to one another in various layers of
the networks. These neurons are known as nodes.
A typical neural network contains a large number of artificial neurons which are termed units arranged
in a series of layers. Different kinds of layers available in an artificial neural network:
Fig. Architecture of ANN
• Input layer:
The Input layers contain those artificial neurons (termed as units) which are to receive input
from the outside world. This is where the actual learning on the network happens, or recognition
happens else it will process.
• Output layer:
The output layers contain units that respond to the information that is fed into the system and
also whether it learned any task or not.
• Hidden layer:
The hidden layers are mentioned hidden in between input layers and the output layers. It performs all
the calculations to find hidden features and patterns.
Working of ANN:
• The Artificial Neural Network receives the input signal from the external world in the form of a
pattern and image in the form of a vector.
• Each of the input is then multiplied by its corresponding weights.
• All the weighted inputs are summed up inside the computing unit and bias is added to make the
output non-zero.
• And then the sum of weighted inputs is passed through the activation function.
• The activation function, in general, is the set of transfer functions used to get the desired output of
it. There are various types of the activation function, but mainly either linear or non-linear sets of
functions. Some of the most commonly used set of activation functions are the Binary, Sigmoidal
(linear) and Tan hyperbolic sigmoidal (non-linear) activation functions.
•
8. Explain the concept of Feed forward neural network and Backpropagation with its applications.
(8 Marks)
Feedforward Artificial Neural Networks (ANNs) are widely used in machine learning and artificial
intelligence. They can be applied to many problems, including:
1. Pattern recognition: FFNNs are effective for recognizing patterns in data, such as images
or sound. They can be used for image recognition, speech recognition, or even music
genre classification.
2. Prediction: FFNNs can be used to predict future values based on past observations. This
can be applied to financial forecasting, weather prediction, or predicting customer
behaviour.
3. Control: FFNNs can be used for controlling processes, such as optimizing production
processes or controlling traffic signals.
4. Robotics: FFNNs can be used in robotics to control movement and navigation, or to
recognize objects and environments.
5. Natural Language Processing: FFNNs can be used for tasks such as sentiment analysis,
text classification, and language translation.
6. Time Series Analysis: FFNNs can be used to analyse time series data, such as financial
time series or sensor data from Internet of Things (IoT) devices.
7. Anomaly Detection: FFNNs can be used to detect anomalies in data, such as fraud
detection in financial transactions or intrusion detection in computer networks.
Backpropagation ANN
Fig. Backpropagation
• Backpropagation Artificial Neural Network (ANN) is a supervised learning algorithm used to train
a feedforward neural network. It is the most popular and widely used algorithm for training
artificial neural networks.
• Backpropagation is a learning algorithm that adjusts the weights and biases of the neural network
based on the error between the predicted output and the actual output.
• The algorithm works by propagating the error backwards through the network, from the output
layer to the input layer, and adjusting the weights and biases of each neuron along the way.
• The backpropagation algorithm consists of two phases: forward propagation and backward
propagation. During forward propagation, the input is fed into the neural network, and the network
calculates the output. During backward propagation, the error between the predicted output and
the actual output is calculated, and the weights and biases of each neuron are adjusted to reduce
the error.
• The backpropagation algorithm uses the gradient descent optimization method to update the
weights and biases of the network. The gradient descent method calculates the gradient of the error
function with respect to the weights and biases, and then updates the weights and biases in the
direction of the negative gradient, which reduces the error.
Backpropagation ANN can be used for a variety of applications, including image classification, speech
recognition, natural language processing, and time series prediction.
Backpropagation Artificial Neural Networks (ANNs) have been used in a variety of applications. Here
are some common use cases of backpropagation ANNs:
• Image Recognition: Backpropagation ANNs have been used for image recognition tasks, such as
recognizing handwritten digits, classifying images, and detecting objects in images.
• Speech Recognition: Backpropagation ANNs have been used for speech recognition tasks, such
as transcribing speech to text and identifying spoken words.
• Natural Language Processing (NLP): Backpropagation ANNs have been used for NLP tasks,
such as sentiment analysis, machine translation, and text classification.
• Time Series Analysis: Backpropagation ANNs have been used for time series analysis tasks, such
as stock price prediction, weather forecasting, and predicting trends in financial data.
• Robotics: Backpropagation ANNs have been used in robotics to control movement and navigation,
or to recognize objects and environments.
• Anomaly Detection: Backpropagation ANNs have been used to detect anomalies in data, such as
fraud detection in financial transactions or intrusion detection in computer networks.
The Main Problem we face is because of Saturated Gradients, as the Function ranges between
0 to 1, the values might remain constant, thus the gradients will have very less values.
Therefore, no change during gradient descent.
tanh can be considered as a good example in case when input>0, so the gradients we will obtain
will either be all positive or negative, which can lead to explosion or vanishing issue, thus usage
of tanh can be a good thing. but this still faces the problem of Saturated Gradients.
ReLU is the most commonly used Activation Functions, because of its simplicity during
backpropagation and is not computationally expensive. It has the following properties:
(a.) It doesn’t Saturate.
(b.) It converges faster than some other activation functions.
But we can face an issue of dead ReLU, for Example, if:
w>0, x<0. So, ReLU(w*x) = 0, Always.
4. Leaky ReLU:
ELU is also a variation of ReLU, with a better value for x<0. It also has the same properties
as ReLU along with:
(a.) No Dead ReLU Situation.
(b.) Closer to Zero mean Outputs than Leaky ReLU.
(c.) More Computation because of Exponential Function.
6. Maxout:
Maxout has been introduced in 2013. It has the property of Linearity in it. So, it never saturates or
dies. But is Expensive as it doubles the parameters.
10. What is optimization and optimizers? List various optimization algorithms. (4 Marks)
Optimization: In deep learning, we have the concept of loss, which tells us how poorly the model is
performing at that current instant. Now we need to use this loss to train our network such that it
performs better. Essentially what we need to do is to take the loss and try to minimize it, because a
lower loss means our model is going to perform better. The process of minimizing (or maximizing) any
mathematical expression is called optimization.
Optimizers: Optimizers are algorithms or methods used to change the attributes of the neural network
such as weights and learning rate to reduce the losses. Optimizers are used to solve optimization
problems by minimizing the function.
optimization algorithms:
Following are the different types of optimizers and how they exactly work to minimize the loss function.
• Gradient Descent
• Stochastic Gradient Descent
• Mini-Batch Gradient Descent
• SGD with Momentum
• Adagrad (Adaptive Gradient)
• AdaDelta
• RMSprop
• Adam
Gradient Descent:
Advantages:
• Easy computation.
• Easy to implement.
• Easy to understand.
Disadvantages:
• May trap at local minima.
• Weights are changed after calculating the gradient on the whole dataset.
• So, if the dataset is too large then this may take years to converge to the minima.
• Requires large memory to calculate the gradient on the whole dataset.
12. What is the problem of vanishing gradient? Describe various solutions to this problem.
( 8 Marks)
• The sigmoid function is one of the most popular activations functions used for developing deep
neural networks. The use of sigmoid function restricted the training of deep neural networks
because it caused the vanishing gradient problem.
• Sigmoid functions are used frequently in neural networks to activate neurons. It is a logarithmic
function with a characteristic S shape. The output value of the function is between 0 and 1. The
sigmoid function is used for activating the output layers in binary classification problems. It is
calculated as follows:
• On the graph below you can see a comparison between the sigmoid function itself and its
derivative. The range of sigmoid function is 0 to 1 and first derivatives of sigmoid functions
are bell curves with values ranging from 0 to 0.25.
• Now, how neural networks perform forward and backpropagation is essential to understanding
the vanishing gradient problem.
• Forward Propagation: The basic structure of a neural network is an input layer, one or more
hidden layers, and a single output layer. The weights of the network are randomly initialized
during forward propagation. The input features are multiplied by the corresponding weights at
each node of the hidden layer, and a bias is added to the net sum at each node. This value is
then transformed into the output of the node using an activation function. To generate the output
of the neural network, the hidden layer output is multiplied by the weights plus bias values, and
the total is transformed using another activation function. This will be the predicted value of
the neural network for a given input value.
• Backward Propagation: As the network generates an output, the loss function(C) indicates
how well it predicted the output. The network performs back propagation to minimize the loss.
A back propagation method minimizes the loss function by adjusting the weights and biases of
the neural network. In this method, the gradient of the loss function is calculated with respect
to each weight in the network.
• In back propagation, the new weight(wnew) of a node is calculated using the old weight(wold)
and product of the learning rate(ƞ) and gradient of the loss function.
• ReLU: The vanishing gradient problem is caused by the derivative of the activation function
used to create the neural network. The simplest solution to the problem is to replace the
activation function of the network. Instead of sigmoid, use an activation function such as
ReLU.
Rectified Linear Units (ReLU) are activation functions that generate a positive linear output
when they are applied to positive input values. If the input is negative, the function will return
zero.
The derivative of a ReLU function is defined as 1 for inputs that are greater than zero and 0 for
inputs that are negative.
If the ReLU function is used for activation in a neural network in place of a sigmoid function,
the value of the partial derivative of the loss function will be having values of 0 or 1 which
prevents the gradient from vanishing.
• Leaky ReLU: The problem with the use of ReLU is when the gradient has a value of 0. In such
cases, the node is considered as a dead node since the old and new values of the weights remain
the same. This situation can be avoided by the use of a leaky ReLU function which prevents
the gradient from falling to the zero value.
• Weight initialization: Another technique to avoid the vanishing gradient problem is weight
initialization. This is the process of assigning initial values to the weights in the neural network
so that during back propagation, the weights never vanish.
Hyperparameters of a neural network are variables that determine the network’s architecture and
behaviour during training. They include the number of layers, the number of nodes in each layer, the
activation functions, learning rate, batch size, regularization parameters, dropout rate, optimizer choice,
and weight initialization methods. Tuning these hyperparameters is crucial for optimizing the neural
network’s performance.
Hyperparameter tuning in neural networks refers to the process of finding the optimal combination of
hyperparameters to maximize the performance and effectiveness of the network. It involves
systematically exploring different values or ranges of hyperparameters, training and evaluating the
network for each configuration, and selecting the set of hyperparameters that yield the best performance
on a validation set
One of the major aspects of training your machine learning model is avoiding overfitting. The model
will have a low accuracy if it is overfitting. This happens because your model is trying too hard to
capture the noise in your training dataset. Ridge and Lasso regression are some of the simple techniques
to reduce model complexity and prevent over-fitting which may result from simple linear regression.
The key difference is below.
o So ridge regression puts constraint on the coefficients (w). The penalty term (lambda)
regularizes the coefficients such that if the coefficients take large values the
optimization function is penalized. So, ridge regression shrinks the coefficients and
it helps to reduce the model complexity and multi-collinearity.
o The only difference is instead of taking the square of the coefficients, magnitudes are
taken into account.
o This type of regularization (L1) can lead to zero coefficients i.e. some of the features
are completely neglected for the evaluation of output. So Lasso regression not only
helps in reducing over-fitting but it can help us in feature selection.
• “Dropout” in machine learning refers to the process of randomly ignoring certain nodes in a layer
during training.
• This means the contribution of the dropped neurons is temporally removed and they do not have
an impact on the model’s performance.
• In the figure below, the neural network on the left represents a typical neural network where all
units are activated. On the right, the red units have been dropped out of the model.
• The values of their weights and biases are not considered during training.