KEMBAR78
DL Unit1 Notes | PDF | Machine Learning | Deep Learning
0% found this document useful (0 votes)
22 views19 pages

DL Unit1 Notes

Deep Learning (DL) is a subset of Machine Learning (ML) that uses neural networks with multiple layers to learn from large datasets, enabling applications like self-driving cars, social media enhancements, and speech recognition. It differs from traditional ML by utilizing deep architectures and automatically learning features from data. Key components of neural networks include inputs, weights, biases, and activation functions, with various types of networks such as perceptrons and multi-layer perceptrons serving different functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views19 pages

DL Unit1 Notes

Deep Learning (DL) is a subset of Machine Learning (ML) that uses neural networks with multiple layers to learn from large datasets, enabling applications like self-driving cars, social media enhancements, and speech recognition. It differs from traditional ML by utilizing deep architectures and automatically learning features from data. Key components of neural networks include inputs, weights, biases, and activation functions, with various types of networks such as perceptrons and multi-layer perceptrons serving different functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

1. What is Deep Learning? Explain its applications.

(5 Marks)
Deep Learning (DL) is a subset of Machine Learning (ML) that allows us to train a model using a set
of inputs and then predict output.
It is essentially a neural network with three or more layers.
These neural networks attempt to simulate the behavior of the human brain allowing it to “learn” from
large amounts of data.
Deep learning drives many artificial intelligence (AI) applications and services that improve
automation, performing analytical and physical tasks without human intervention. Deep learning
technology lies behind everyday products and services (such as digital assistants, voice-enabled TV
remotes, and credit card fraud detection) as well as emerging technologies (such as self-driving cars).

Applications:

1. Self-driving Cars: One of the fascinating technologies, self-driving cars, are designed using
deep neural networks at a high level, where these cars use machine learning algorithms. They
detect objects around the car, the distance between the car and other vehicles, the footway
location, identify traffic signals, determine the driver's condition, etc. For example, Tesla is the
most reliable brand that brings automated, self-driving cars in the market.
2. Social Media: Twitter deploys deep learning algorithms to enhance their product. They access
and analyze a lot of data by the deep neural network to learn over time about the possibilities
of user preferences. Instagram uses deep learning to avoid cyberbullying, erasing annoying
comments. Facebook uses deep learning to recommend pages, friends, products, etc. Moreover,
Facebook uses the ANN algorithm for facial recognition that makes perfect tagging plausible.
3. Image classification/machine vision: We see Facebook providing a suggestion for auto-
tagging different persons in a picture is a perfect example of machine vision. It uses deep nets
and takes pictures at different angles, and then labels the name to that picture. These deep
learning models are now so advanced that we can recognize different objects in a picture and
can predict what could be the occasion in that picture. For example, a picture taken in the
restaurant has different features in it, like tables, chairs, different food items, knife, fork, glass,
beer (brand of the beer), the mood of the people in the picture, etc. By looking at the images
posted by a person can detect the likings of that person and recommend similar things to buy
or places to visit etc.
4. Speech Recognition: Speech is the most common method of communication in human society.
As a human recognize speech understands it and responds accordingly, the same way deep
learning model is enhancing the capabilities of computers so that they can understand how
humans do react to different speeches. In day-to-day life, we have live examples like Siri of
Apple, Alexa from Amazon, google home mini, etc. In the speech, there are lots of factors that
needed to be considered like language/ accent / Age / Gender/ sound quality, etc. The goal is to
recognize and respond to an unknown speaker by the input of his/her sound signals.
5. Market Prediction: Deep learning models can predict buy and sell calls for traders, depending
on the dataset how the model has been trained, it is useful for both short term trading game as
well as long term investment based on the available features.

2. Differentiate between Machine Learning and Deep Learning (4 Marks)


Sr.
Machine learning Deep learning
No.

1 Machine Learning is a superset of Deep Learning Deep Learning is a subset of Machine Learning

The data represented in Machine Learning is quite


The data representation is used in Deep Learning is
2 different as compared to Deep Learning as it uses
quite different as it uses neural networks (ANN).
structured data.

ML algorithms use shallow architectures.


DL algorithms use deep architectures.
Shallow learning, also known as shallow machine Architecture with many layers will be called deep,
3 learning or traditional machine learning, refers to a architectures with one hidden layer will be called
class of machine learning algorithms that typically shallow.
involve a single layer of data transformation and
learning.

It uses hand-crafted features.


4 It learns features from the data.
(Features that are manually engineered by the data
scientist.)

ML algorithms requires manual feature


5 DL algorithms can learn features automatically.
engineering.

6 It is limited to linear models. It uses linear as well as non-linear models.

ML algorithms are limited to supervised learnings.


DL algorithms can also be used in unsupervised
Supervised learning is a form of ML in which the learning.
model is trained to associate input data with
7 Unlike supervised learning, unsupervised learning
specific output labels, drawing from labeled
models are given unlabeled data and allowed to
training data. Here, the algorithm is furnished with
discover patterns and insights without any explicit
a dataset containing input features paired with
guidance or instruction.
corresponding output labels.

It requires more data to achieve good performance


It can achieve good performance with less data and
8 and can work on the CPU or requires less
requires a high-performance computer with GPU.
computing power as compared to deep learning.

9 It takes more time to train the model. It takes less time to train the model.
Neuron
A neuron takes a group of weighted inputs, applies an activation function, and returns an output.

Inputs to a neuron can either be features from a training set or outputs from a previous layer’s
neurons. Weights are applied to the inputs as they travel along synapses to reach the neuron. The
neuron then applies an activation function to the “sum of weighted inputs” from each incoming
synapse and passes the result on to all the neurons in the next layer.
3. Define the basic terminologies used in Neural Network. (5 Marks)

Neural Network:

Fig. A Neural Network

A neural network is a very powerful machine learning mechanism which basically mimics how a human
brain learns. The brain receives the stimulus from the outside world, does the processing on the input,
and then generates the output.

Terminologies:

1. Inputs: Inputs are the set of values for which we need to predict a output value. They can be
viewed as features or attributes in a dataset.

2. Weights: weights are the real values that are attached with each input/feature and they convey
the importance of that corresponding feature in predicting the final output. (will discuss about
this in-detail)

3. Bias: Bias is used for shifting the activation function towards left or right, you can compare this
to y-intercept in the line equation. (will discuss more about this)

4. Summation Function: The work of the summation function is to bind the weights and inputs
together and calculate their sum.

5. Activation Function: It is used to introduce non-linearity in the model.

4. How the neural network is classified in deep learning. (4 Marks)


There are many different types of neural networks, and they help us in a variety of everyday tasks
from recommending movies or music to helping us buy groceries online. They can be classified
depending on their: Structure, Data flow, Neurons used and their density, Layers and their depth
activation filters etc. Similar to the way airplanes were inspired by birds, neural networks (NNs) are
inspired by biological neural networks. let’s look at the various types and functions of the neural
networks used in deep learning.

1. Perceptron (Single-layer Perceptron)


2. Multi-layer Perceptron
3. Artificial Neural Network
4. Convolutional Neural Networks
5. Recurrent Neural Networks
6. Long Short-Term Memory Networks
5. With suitable diagram explain the types of perceptron’s? (6 Marks)
Or
Draw and explain architecture of single and multi-layer feed forward neural network.
A perceptron is a neural network unit (an artificial neuron) that does certain computations to detect
features or business intelligence in the input data. There are two types of perceptron’s:

Single layer perceptron’s – also called as Single-layer Feed Forward Neural Network
Simplest feedforward neural network and does not contain any hidden layer. These can only learn
linearly separable patterns. Input nodes are fully connected to a node or multiple nodes in the succeeding
layer. Nodes in the next layer take a weighted sum of their inputs. Single-layer Perceptron is not able
to figure out the nonlinearity or complexity of the data.

Fig. Single Layer Perceptron


Multilayer perceptron’s – also called as Multilayer Feed Forward Neural Network
MLP contains one or more hidden layers (apart from one input and one output layer). These have the
greatest processing power. They are a class of feedforward neural networks. It is a neural network that
connects multiple layers in a directed graph. This means that the signal path through the nodes only
goes one way.

Fig. Multi- Layer Perceptron

They have a hidden layer and use sophisticated algorithms like backpropagation. A multilayer
perceptron generates a set of outputs from a set of inputs. A multilayer perceptron consists of input,
output, and hidden layers. Each hidden layer is made up of numerous perceptron’s which are known as
hidden layers or hidden unit. A multilayer perceptron (MLP) is a feed forward artificial neural network
that generates a set of outputs from a set of inputs. An MLP is characterized by several layers of input
nodes connected as a directed graph between the input nodes connected as a directed graph between the
input and output layers. MLP uses backpropagation for training the network. MLP is a deep learning
method. Multiple Hidden layers are used to find the nonlinearity of the data. This instruction is also
called a feed-forward network.
6. How a perceptron model works.
A perceptron is a neural network unit (an artificial neuron). Perceptron is a single layer neural network.
Perceptron is a linear classifier (binary). Also, it is used in supervised learning. It helps to classify the
given input data. Perceptron is usually used to classify the data into two parts. Therefore, it is also
known as a Linear Binary Classifier.

Fig. Working of Perceptron

The perceptron consists of 4 parts: Input values or One input layer, Weights and Bias, Net sum and
Activation function.

The perceptron works on these simple steps

a. All the inputs x are multiplied with their weights w.

b. Add all the multiplied values and call them Weighted Sum.

c. Apply that weighted sum to the correct Activation Function.

The activation function as a mathematical function that can normalize the inputs.

7. Explain working of artificial neuron. (4 Marks)


Or
Explain Neural Network Architecture (ANN)

The term "Artificial Neural Network" is derived from Biological neural networks that develop the
structure of a human brain. Similar to the human brain that has neurons interconnected to one another,
artificial neural networks also have neurons that are interconnected to one another in various layers of
the networks. These neurons are known as nodes.
A typical neural network contains a large number of artificial neurons which are termed units arranged
in a series of layers. Different kinds of layers available in an artificial neural network:
Fig. Architecture of ANN

• Input layer:
The Input layers contain those artificial neurons (termed as units) which are to receive input
from the outside world. This is where the actual learning on the network happens, or recognition
happens else it will process.
• Output layer:
The output layers contain units that respond to the information that is fed into the system and
also whether it learned any task or not.
• Hidden layer:

The hidden layers are mentioned hidden in between input layers and the output layers. It performs all
the calculations to find hidden features and patterns.

Working of ANN:

Fig. Working of Artificial Neuron

• The Artificial Neural Network receives the input signal from the external world in the form of a
pattern and image in the form of a vector.
• Each of the input is then multiplied by its corresponding weights.
• All the weighted inputs are summed up inside the computing unit and bias is added to make the
output non-zero.
• And then the sum of weighted inputs is passed through the activation function.
• The activation function, in general, is the set of transfer functions used to get the desired output of
it. There are various types of the activation function, but mainly either linear or non-linear sets of
functions. Some of the most commonly used set of activation functions are the Binary, Sigmoidal
(linear) and Tan hyperbolic sigmoidal (non-linear) activation functions.

8. Explain the concept of Feed forward neural network and Backpropagation with its applications.
(8 Marks)

Feedforward Artificial Neural Network:


• A feedforward Artificial Neural Network (ANN) is a type of neural network in which the
information flows in only one direction, from the input layer, through one or more hidden layers,
to the output layer, without any feedback loops. It is called "feedforward" because the data only
moves forward through the network and never loops back on itself.
• In a feedforward ANN, the input layer receives input data, and each neuron in the input layer is
connected to the neurons in the next layer, the hidden layer. The hidden layer can contain one or
more layers, and each neuron in the hidden layer is connected to the neurons in the output layer.
The output layer produces the output for a given input.
• During training, the weights of the connections between the neurons are adjusted using
backpropagation to minimize the error between the predicted output and the actual output. Once
the network is trained, it can be used to make predictions on new data.
• Feedforward ANNs are often used for classification or regression problems and have been
successfully applied in a wide range of applications, including speech recognition, image
classification, and natural language processing.

Use Cases of Feedforward ANN

Feedforward Artificial Neural Networks (ANNs) are widely used in machine learning and artificial
intelligence. They can be applied to many problems, including:

1. Pattern recognition: FFNNs are effective for recognizing patterns in data, such as images
or sound. They can be used for image recognition, speech recognition, or even music
genre classification.
2. Prediction: FFNNs can be used to predict future values based on past observations. This
can be applied to financial forecasting, weather prediction, or predicting customer
behaviour.
3. Control: FFNNs can be used for controlling processes, such as optimizing production
processes or controlling traffic signals.
4. Robotics: FFNNs can be used in robotics to control movement and navigation, or to
recognize objects and environments.
5. Natural Language Processing: FFNNs can be used for tasks such as sentiment analysis,
text classification, and language translation.
6. Time Series Analysis: FFNNs can be used to analyse time series data, such as financial
time series or sensor data from Internet of Things (IoT) devices.
7. Anomaly Detection: FFNNs can be used to detect anomalies in data, such as fraud
detection in financial transactions or intrusion detection in computer networks.

Backpropagation ANN

Fig. Backpropagation
• Backpropagation Artificial Neural Network (ANN) is a supervised learning algorithm used to train
a feedforward neural network. It is the most popular and widely used algorithm for training
artificial neural networks.
• Backpropagation is a learning algorithm that adjusts the weights and biases of the neural network
based on the error between the predicted output and the actual output.
• The algorithm works by propagating the error backwards through the network, from the output
layer to the input layer, and adjusting the weights and biases of each neuron along the way.
• The backpropagation algorithm consists of two phases: forward propagation and backward
propagation. During forward propagation, the input is fed into the neural network, and the network
calculates the output. During backward propagation, the error between the predicted output and
the actual output is calculated, and the weights and biases of each neuron are adjusted to reduce
the error.
• The backpropagation algorithm uses the gradient descent optimization method to update the
weights and biases of the network. The gradient descent method calculates the gradient of the error
function with respect to the weights and biases, and then updates the weights and biases in the
direction of the negative gradient, which reduces the error.

Backpropagation ANN can be used for a variety of applications, including image classification, speech
recognition, natural language processing, and time series prediction.

Use Cases of Backpropagation ANN

Backpropagation Artificial Neural Networks (ANNs) have been used in a variety of applications. Here
are some common use cases of backpropagation ANNs:

• Image Recognition: Backpropagation ANNs have been used for image recognition tasks, such as
recognizing handwritten digits, classifying images, and detecting objects in images.
• Speech Recognition: Backpropagation ANNs have been used for speech recognition tasks, such
as transcribing speech to text and identifying spoken words.
• Natural Language Processing (NLP): Backpropagation ANNs have been used for NLP tasks,
such as sentiment analysis, machine translation, and text classification.
• Time Series Analysis: Backpropagation ANNs have been used for time series analysis tasks, such
as stock price prediction, weather forecasting, and predicting trends in financial data.
• Robotics: Backpropagation ANNs have been used in robotics to control movement and navigation,
or to recognize objects and environments.
• Anomaly Detection: Backpropagation ANNs have been used to detect anomalies in data, such as
fraud detection in financial transactions or intrusion detection in computer networks.

9. Explain Different activation functions in deep learning. (Any Two) (6 Marks)


There are various kind of Activation Functions that exists, and Some Researchers are still working on
finding better functions, which can help networks to converge faster or use less layers, etc.
Fig. Activation Functions

1. Sigmoid Activation Function:

(a.) Range from [0,1].


(b.) Not Zero Centered.
(c.) Have Exponential Operation (It's Computationally Expensive!!!)

The Main Problem we face is because of Saturated Gradients, as the Function ranges between
0 to 1, the values might remain constant, thus the gradients will have very less values.
Therefore, no change during gradient descent.

2. Tanh: Hyperbolic Tangent Activation Function

Hyperbolic Tangent also have the following properties:


(a.) Ranges Between [-1,1]
(b.) Zero Cantered

tanh can be considered as a good example in case when input>0, so the gradients we will obtain
will either be all positive or negative, which can lead to explosion or vanishing issue, thus usage
of tanh can be a good thing. but this still faces the problem of Saturated Gradients.

3. ReLU: Rectified Linear Unit Activation Function

ReLU is the most commonly used Activation Functions, because of its simplicity during
backpropagation and is not computationally expensive. It has the following properties:
(a.) It doesn’t Saturate.
(b.) It converges faster than some other activation functions.
But we can face an issue of dead ReLU, for Example, if:
w>0, x<0. So, ReLU(w*x) = 0, Always.

4. Leaky ReLU:

• Leaky ReLU can be used as an improvement over ReLU Activation function.


• It has all properties of ReLU, plus it will never have a dead ReLU problem.

5. ELU (Exponential Linear Units):

ELU is also a variation of ReLU, with a better value for x<0. It also has the same properties
as ReLU along with:
(a.) No Dead ReLU Situation.
(b.) Closer to Zero mean Outputs than Leaky ReLU.
(c.) More Computation because of Exponential Function.

6. Maxout:

Maxout has been introduced in 2013. It has the property of Linearity in it. So, it never saturates or
dies. But is Expensive as it doubles the parameters.

10. What is optimization and optimizers? List various optimization algorithms. (4 Marks)
Optimization: In deep learning, we have the concept of loss, which tells us how poorly the model is
performing at that current instant. Now we need to use this loss to train our network such that it
performs better. Essentially what we need to do is to take the loss and try to minimize it, because a
lower loss means our model is going to perform better. The process of minimizing (or maximizing) any
mathematical expression is called optimization.

Optimizers: Optimizers are algorithms or methods used to change the attributes of the neural network
such as weights and learning rate to reduce the losses. Optimizers are used to solve optimization
problems by minimizing the function.

optimization algorithms:
Following are the different types of optimizers and how they exactly work to minimize the loss function.
• Gradient Descent
• Stochastic Gradient Descent
• Mini-Batch Gradient Descent
• SGD with Momentum
• Adagrad (Adaptive Gradient)
• AdaDelta
• RMSprop
• Adam

11. Explain the concept of Gradient based learning. (6 Marks)

Gradient Descent:

• Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable


function. Gradient descent is simply used to find the values of a function's parameters
(coefficients) that minimize a cost function as far as possible.
• "A gradient measure how much the output of a function changes if you change the inputs a little
bit."
• Why GD???: Neural networks takes several inputs, processes it through multiple neurons from
multiple hidden layers, and returns the result using an output layer. This result estimation
process is technically known as “Forward Propagation“.
• Next, we compare the result with actual output. Each of these neurons is contributing some
error to the final output.
• We try to minimize the value/ weight of neurons that are contributing more to the error and this
happens while traveling back to the neurons of the neural network and finding where the error
lies. This process is known as “Backward Propagation “.
• In order to reduce this number of iterations to minimize the error, the neural networks use a
common algorithm known as “Gradient Descent”, which helps to optimize the task quickly
and efficiently.
• Learning rate: gradient descent takes into the direction of the local minimum are determined
by the learning rate, which figures out how fast or slows we will move towards the optimal
weights.
For gradient descent to reach the local minimum we must set the learning rate to an appropriate
value, which is neither too low nor too high. This is important because if the steps it takes are
too big, it may not reach the local minimum because it bounces back and forth between the
convex function of gradient descent (see left image below). If we set the learning rate to a very
small value, gradient descent will eventually reach the local minimum but that may take a while.

Fig. Learning Rate to reach local minima


So, the learning rate should never be too high or too low for this reason.
• The below equation computes the gradient of the cost function J(θ) w.r.t. to the
parameters/weights θ for the entire training dataset:

Advantages:
• Easy computation.
• Easy to implement.
• Easy to understand.

Disadvantages:
• May trap at local minima.
• Weights are changed after calculating the gradient on the whole dataset.
• So, if the dataset is too large then this may take years to converge to the minima.
• Requires large memory to calculate the gradient on the whole dataset.

12. What is the problem of vanishing gradient? Describe various solutions to this problem.
( 8 Marks)
• The sigmoid function is one of the most popular activations functions used for developing deep
neural networks. The use of sigmoid function restricted the training of deep neural networks
because it caused the vanishing gradient problem.
• Sigmoid functions are used frequently in neural networks to activate neurons. It is a logarithmic
function with a characteristic S shape. The output value of the function is between 0 and 1. The
sigmoid function is used for activating the output layers in binary classification problems. It is
calculated as follows:
• On the graph below you can see a comparison between the sigmoid function itself and its
derivative. The range of sigmoid function is 0 to 1 and first derivatives of sigmoid functions
are bell curves with values ranging from 0 to 0.25.

Fig. sigmoid function and sigmoid derivative range

• Now, how neural networks perform forward and backpropagation is essential to understanding
the vanishing gradient problem.
• Forward Propagation: The basic structure of a neural network is an input layer, one or more
hidden layers, and a single output layer. The weights of the network are randomly initialized
during forward propagation. The input features are multiplied by the corresponding weights at
each node of the hidden layer, and a bias is added to the net sum at each node. This value is
then transformed into the output of the node using an activation function. To generate the output
of the neural network, the hidden layer output is multiplied by the weights plus bias values, and
the total is transformed using another activation function. This will be the predicted value of
the neural network for a given input value.
• Backward Propagation: As the network generates an output, the loss function(C) indicates
how well it predicted the output. The network performs back propagation to minimize the loss.
A back propagation method minimizes the loss function by adjusting the weights and biases of
the neural network. In this method, the gradient of the loss function is calculated with respect
to each weight in the network.
• In back propagation, the new weight(wnew) of a node is calculated using the old weight(wold)
and product of the learning rate(ƞ) and gradient of the loss function.

Where, is gradient of loss function.


• With the chain rule of partial derivatives, we can represent gradient of the loss function as a
product of gradients of all the activation functions of the nodes with respect to their weights.
Therefore, the updated weights of nodes in the network depend on the gradients of the activation
functions of each node.
• we know that the partial derivative of the sigmoid function reaches a maximum value of 0.25.
When there are more layers in the network, the value of the product of derivative decreases
until at some point the partial derivative of the loss function approaches a value close to zero,
and the partial derivative vanishes. We call this the vanishing gradient problem.
• With shallow networks, sigmoid function can be used as the small value of gradient does not
become an issue. When it comes to deep networks, the vanishing gradient could have a
significant impact on performance. The weights of the network remain unchanged as the
derivative vanishes. During back propagation, a neural network learns by updating its weights
and biases to reduce the loss function. In a network with vanishing gradient, the weights cannot
be updated, so the network cannot learn. The performance of the network will decrease as a
result.

Method to overcome the problem:

• ReLU: The vanishing gradient problem is caused by the derivative of the activation function
used to create the neural network. The simplest solution to the problem is to replace the
activation function of the network. Instead of sigmoid, use an activation function such as
ReLU.

Fig. ReLU Activation Function

Rectified Linear Units (ReLU) are activation functions that generate a positive linear output
when they are applied to positive input values. If the input is negative, the function will return
zero.
The derivative of a ReLU function is defined as 1 for inputs that are greater than zero and 0 for
inputs that are negative.
If the ReLU function is used for activation in a neural network in place of a sigmoid function,
the value of the partial derivative of the loss function will be having values of 0 or 1 which
prevents the gradient from vanishing.
• Leaky ReLU: The problem with the use of ReLU is when the gradient has a value of 0. In such
cases, the node is considered as a dead node since the old and new values of the weights remain
the same. This situation can be avoided by the use of a leaky ReLU function which prevents
the gradient from falling to the zero value.
• Weight initialization: Another technique to avoid the vanishing gradient problem is weight
initialization. This is the process of assigning initial values to the weights in the neural network
so that during back propagation, the weights never vanish.

13. What are the hyperparameters of a neural network?

Hyperparameters of a neural network are variables that determine the network’s architecture and
behaviour during training. They include the number of layers, the number of nodes in each layer, the
activation functions, learning rate, batch size, regularization parameters, dropout rate, optimizer choice,
and weight initialization methods. Tuning these hyperparameters is crucial for optimizing the neural
network’s performance.

14. What is Hyperparameter tuning in neural network?

Hyperparameter tuning in neural networks refers to the process of finding the optimal combination of
hyperparameters to maximize the performance and effectiveness of the network. It involves
systematically exploring different values or ranges of hyperparameters, training and evaluating the
network for each configuration, and selecting the set of hyperparameters that yield the best performance
on a validation set

15. What is need of regularization? Explain L1 and L2 regularization.

One of the major aspects of training your machine learning model is avoiding overfitting. The model
will have a low accuracy if it is overfitting. This happens because your model is trying too hard to
capture the noise in your training dataset. Ridge and Lasso regression are some of the simple techniques
to reduce model complexity and prevent over-fitting which may result from simple linear regression.
The key difference is below.

1. Ridge Regression: L2 Regularization


o Performs L2 regularization,
o In ridge regression, the cost function is altered by adding a penalty equivalent to square
of the magnitude of the coefficients.

o So ridge regression puts constraint on the coefficients (w). The penalty term (lambda)
regularizes the coefficients such that if the coefficients take large values the
optimization function is penalized. So, ridge regression shrinks the coefficients and
it helps to reduce the model complexity and multi-collinearity.

2. Lasso Regression: L1 Regularization


o Performs L1 regularization, i.e., adds penalty equivalent to the absolute value of the
magnitude of coefficients.

o The only difference is instead of taking the square of the coefficients, magnitudes are
taken into account.
o This type of regularization (L1) can lead to zero coefficients i.e. some of the features
are completely neglected for the evaluation of output. So Lasso regression not only
helps in reducing over-fitting but it can help us in feature selection.

16. Explain dropout regularization. (5 Marks)

• “Dropout” in machine learning refers to the process of randomly ignoring certain nodes in a layer
during training.
• This means the contribution of the dropped neurons is temporally removed and they do not have
an impact on the model’s performance.
• In the figure below, the neural network on the left represents a typical neural network where all
units are activated. On the right, the red units have been dropped out of the model.
• The values of their weights and biases are not considered during training.

Fig. Standard Neural Net and after applying Dropout

• How will dropout help with overfitting?


Dropout regularization will ensure the following:
• The neurons can’t rely on one input because it might be dropped out at random. This reduces
bias due to over-relying on one input, bias is a major cause of overfitting.
• Neurons will not learn redundant details of inputs. This ensures only important information is
stored by the neurons. This enables the neural network to gain useful knowledge which it uses
to make predictions.

17. Explain regularization of neural networks using DropConnect.


• DropConnect is a generalization of Dropout for regularizing large fully-connected layers within
neural networks.
• When training with Dropout, a randomly selected subset of activations are set to zero within each
layer.
• DropConnect instead sets a randomly selected sub-set of weights within the network to zero.
• Each unit thus receives input from a random subset of units in the previous layer.
• We derive a bound on the generalization performance of both Dropout and DropConnect.
• We then evaluate DropConnect on a range of datasets, comparing to Dropout, and show state-of-
the-art results on several image recognition benchmarks by aggregating multiple DropConnect-
trained models.

18. Different Ways of Hyperparameters Tuning


Hyperparameters are configuration variables that control the learning process of a machine learning
model. They are distinct from model parameters, which are the weights and biases that are learned
from the data. There are several different types of hyperparameters:
Hyperparameters in Neural Networks
Neural networks have several essential hyperparameters that need to be adjusted, including:
• Learning rate: This hyperparameter controls the step size taken by the optimizer during
each iteration of training. Too small a learning rate can result in slow convergence, while
too large a learning rate can lead to instability and divergence.
• Epochs: This hyperparameter represents the number of times the entire training dataset is
passed through the model during training. Increasing the number of epochs can improve
the model’s performance but may lead to overfitting if not done carefully.
• Number of layers: This hyperparameter determines the depth of the model, which can
have a significant impact on its complexity and learning ability.
• Number of nodes per layer: This hyperparameter determines the width of the model,
influencing its capacity to represent complex relationships in the data.
• Architecture: This hyperparameter determines the overall structure of the neural
network, including the number of layers, the number of neurons per layer, and the
connections between layers. The optimal architecture depends on the complexity of the
task and the size of the dataset
• Activation function: This hyperparameter introduces non-linearity into the
model, allowing it to learn complex decision boundaries. Common activation functions
include sigmoid, tanh, and Rectified Linear Unit (ReLU).
Hyperparameters in Support Vector Machine
We take into account some essential hyperparameters for fine-tuning SVMs:
• C: The regularization parameter that controls the trade-off between the margin and the
number of training errors. A larger value of C penalizes training errors more heavily,
resulting in a smaller margin but potentially better generalization performance. A smaller
value of C allows for more training errors but may lead to overfitting.
• Kernel: The kernel function that defines the similarity between data points. Different
kernels can capture different relationships between data points, and the choice of kernel
can significantly impact the performance of the SVM. Common kernels include linear,
polynomial, radial basis function (RBF), and sigmoid.
• Gamma: The parameter that controls the influence of support vectors on the decision
boundary. A larger value of gamma indicates that nearby support vectors have a stronger
influence, while a smaller value indicates that distant support vectors have a weaker
influence. The choice of gamma is particularly important for RBF kernels

You might also like