KEMBAR78
DL Unit - 1 Notes | PDF | Artificial Neural Network | Deep Learning
0% found this document useful (0 votes)
631 views45 pages

DL Unit - 1 Notes

Deep learning Note

Uploaded by

sarthakjain1112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
631 views45 pages

DL Unit - 1 Notes

Deep learning Note

Uploaded by

sarthakjain1112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Notes By Sarvagya Jain

AI vs ML vs DL:

AI (Artificial Intelligence), ML (Machine Learning), and DL (Deep Learning) are three closely related
but distinct concepts in the field of computer science and data science.
Artificial Intelligence (AI): Developing machines to mimic human intelligence and behavior. It
includes reasoning, problem-solving, understanding natural language, recognizing patterns, planning,
and decision-making.
Machine Learning (ML): Algorithms that learn from structured data to predict outputs and discover
patterns in that data. ML encompasses various techniques, including supervised learning, unsupervised
learning, and reinforcement learning, among others.
Deep Learning (DL): Algorithms based on highly complex neural networks that mimic the way a
human brain works to detect patterns in large unstructured data sets. DL architectures include
convolutional neural networks (CNNs) for image processing and recurrent neural networks (RNNs)
for sequential data.

Introduction to Deep Learning


Deep learning is the subfield/branch/evolution of machine learning and neural networks, which uses
advanced computer programming and training to understand complex patterns hidden in large data
sets.
Deep learning works with artificial neural networks, which are designed to imitate how humans think
and learn.
DL is about understanding how the human brain works in different situations and then trying to
recreate its behavior.
Deep learning is used to complete complex tasks and train models using unstructured data . For
example, deep learning is commonly used in image classification tasks like facial recognition,defect
detection and image processing.
Deep learning focuses on the development of artificial neural networks with multiple layers of
interconnected nodes(neurons), often referred to as deep neural networks. These networks are inspired
by the structure and function of the human brain.
Key characteristics and concepts of deep learning include:
​ Deep Neural Networks: Deep learning models are characterized by having multiple layers of
interconnected artificial neurons (nodes). These layers can include an input layer, one or more
hidden layers, and an output layer.
​ Feature Learning: Deep neural networks are capable of automatically learning and extracting
features from raw data. In image recognition, for example, lower layers might learn to
recognize edges and simple shapes, while higher layers combine these features to recognize
more complex patterns, such as objects or faces.
​ Backpropagation: Deep learning models are trained using a process called backpropagation.
During training, the model makes predictions, calculates the error between its predictions and
the actual target values, and then adjusts the network's weights and biases to minimize this
Notes By Sarvagya Jain
error. This iterative process continues until the model's performance converges to a satisfactory
level.
​ Activation Functions: Activation functions introduce non-linearity into the neural network,
allowing it to model complex relationships within the data. Common activation functions
include sigmoid, tanh, and rectified linear unit (ReLU).
​ Large Datasets: Deep learning models often require extensive amounts of labeled training data
to generalize well and achieve high performance on tasks. The availability of big data has
played a significant role in the success of deep learning.
​ Computational Resources: Training deep neural networks can be computationally intensive,
and many deep learning tasks benefit from specialized hardware, such as graphics processing
units (GPUs) and tensor processing units (TPUs), to speed up the training process.
Deep learning has been applied to various domains and tasks, including:
● Computer Vision: Object detection, image classification, facial recognition, and image
generation.
● Natural Language Processing (NLP): Machine translation, sentiment analysis, text generation,
and chatbots.
● Speech Recognition: Speech-to-text conversion and voice assistants.
● Recommendation Systems: Personalized content recommendations in e-commerce and
streaming platforms.
● Autonomous Vehicles: Self-driving cars use deep learning for perception and decision-making.

History of Deep Learning


The history of deep learning is a fascinating journey that spans several decades, marked by key
milestones and breakthroughs. Here is an overview of the significant developments in the history of
deep learning:
​ 1940s-1950s: The Origins of Neural Networks
● The concept of artificial neurons and neural networks emerged in the 1940s and 1950s.
Pioneers like Warren McCulloch and Walter Pitts proposed mathematical models of
artificial neurons, which laid the foundation for neural network research.
​ 1960s-1970s: Early Neural Network Research
● The 1960s and 1970s saw the development of perceptrons, a type of neural network.
However, limitations in perceptron architecture were discovered, leading to the
"perceptron winter," during which neural network research declined.
​ 1980s-1990s: Revival of Neural Networks
● In the 1980s, researchers like David Rumelhart, Geoffrey Hinton, and Ronald Williams
reintroduced neural networks with the development of the backpropagation algorithm.
This allowed for efficient training of multi-layer perceptrons and marked the beginning
of the modern era of neural networks.
● The concept of deep learning emerged during this period, but it faced challenges due to
limited computational resources and issues with training deep networks.
​ 2000s: Rise and Fall of Deep Learning
Notes By Sarvagya Jain
● Deep learning continued to evolve with the introduction of various neural network
architectures, including convolutional neural networks (CNNs) for image processing
and recurrent neural networks (RNNs) for sequential data.
● Despite advancements, deep learning faced challenges such as vanishing gradients and
slow training times. As a result, shallow networks and other machine learning
techniques remained popular.
​ 2010s: Deep Learning Resurgence
● The 2010s witnessed a significant resurgence of deep learning, largely fueled by the
availability of large datasets, powerful GPUs, and distributed computing resources.
● Geoffrey Hinton and his team made breakthroughs in training deep networks,
particularly with the introduction of the Rectified Linear Unit (ReLU) activation
function.
● Deep learning achieved unprecedented success in computer vision, natural language
processing, and speech recognition, including the ImageNet Large Scale Visual
Recognition Challenge.
​ 2012: AlexNet and ImageNet
● The AlexNet architecture, developed by Alex Krizhevsky and his team, won the
ImageNet competition, demonstrating the power of deep convolutional neural networks
for image classification.
​ 2015: DeepMind's AlphaGo
● DeepMind's AlphaGo, a deep reinforcement learning system, defeated the world
champion Go player, marking a milestone in AI research and showcasing the
capabilities of deep learning.
​ 2018-2020: Transformers and BERT
● The introduction of the Transformer architecture by Vaswani et al. revolutionized
natural language processing, leading to breakthroughs like the Bidirectional Encoder
Representations from Transformers (BERT) model.
​ Recent Developments: Ongoing Progress
● Deep learning continues to advance with innovations in areas like generative
adversarial networks (GANs), reinforcement learning, and self-supervised learning.
● Research focuses on making deep learning models more efficient, interpretable, and
applicable to various domains.
Biological Neurons: An Overly Simplified Illustration:
Notes By Sarvagya Jain

Dendrite: Receives signals from other neurons


Soma: Processes the information
Axon: Transmits the output of this neuron
Synapse: Point of connection to other neurons
Basically, a neuron takes an input signal (dendrite), processes it like the CPU (soma), passes the output
through a cable like structure to other connected neurons (axon to synapse to other neuron’s dendrite).
Our sense organs interact with the outer world and send the visual and sound information to the
neurons. Let's say you are watching Friends. Now the information your brain receives is taken in by
the “laugh or not” set of neurons that will help you make a decision on whether to laugh or not. Each
neuron gets fired/activated only when its respective criteria (more on this later) is met like shown
below.

In reality, it is not just a couple of neurons which would do the decision making. There is a massively
parallel interconnected network of 10¹¹ neurons (100 billion) in our brain and their connections are not
as simple.
The sense organs pass the information to the first/lowest layer of neurons to process it. And the output
of the processes is passed on to the next layers in a hierarchical manner, some of the neurons will fire
and some won’t and this process goes on until it results in a final response in this case, laughter.
This massively parallel network also ensures that there is a division of work. Each neuron only fires
when its intended criteria is met i.e neuron may perform a certain role to a certain stimulus, as shown
below.
Notes By Sarvagya Jain

It is believed that neurons are arranged in a hierarchical


fashion and each layer has its own role and responsibility.
To detect a face, the brain could be relying on the entire
network and not on a single layer.

The McCulloch-Pitts neuron


The McCulloch-Pitts neuron, also known as the
McCulloch-Pitts model or M-P neuron, is a simplified
mathematical model of a biological neuron. The first
computational model of a neuron was proposed by Warren MuCulloch (neuroscientist) and Walter
Pitts (logician) in 1943.
It may be divided into 2 parts. The first part, g takes an input (ahem dendrite ahem), performs an
aggregation and based on the aggregated value the second part, f makes a decision.
Let's suppose that I want to predict my own decision, whether to watch a random football game or not
on TV. The inputs are all boolean i.e., {0,1} and my output variable is also boolean {0: Will watch it,
1: Won’t watch it}.
So, x_1 could be isPremierLeague On (I like Premier League more)
x_2 could be isItAFriendlyGame (I tend to care less about the friendlies)
x_3 could be isNotHome (Can’t watch it when I’m running errands. Can I?)
x_4 could be isManUnitedPlaying (I am a big Man United fan. GGMU!) and so on.

These inputs can either be excitatory or inhibitory. Inhibitory inputs are those that have maximum
effect on the decision making irrespective of other inputs i.e., if x_3 is 1 (not home) then my output
will always be 0 i.e., the neuron will never fire, so x_3 is an inhibitory input. Excitatory inputs are
Notes By Sarvagya Jain
NOT the ones that will make the neuron fire on their own but they might fire it when combined
together. Formally, this is what is going on:

We can see that g(x) is just doing a sum of the inputs — a simple aggregation. And theta here is called
thresholding parameter. For example, if I always watch the game when the sum turns out to be 2 or
more, the theta is 2 here. This is called the Thresholding Logic.
Key characteristics of the McCulloch-Pitts neuron model include:
​ Binary Activation: In the M-P neuron model, both the inputs and the neuron's output are
binary, taking on values of 0 or 1. This binary nature simplifies the mathematical description.
​ Inputs and Weights: The neuron receives inputs from other neurons or external sources. Each
input is associated with a weight, which represents the importance or strength of that input.
These weights can be either excitatory (positive) or inhibitory (negative).
​ Threshold: The neuron has a threshold value, often denoted as "θ" (theta). The inputs are
linearly combined with their respective weights, and if this sum exceeds the threshold, the
neuron fires (outputs 1); otherwise, it remains inactive (outputs 0).

Mathematically, the McCulloch-Pitts neuron can be expressed as follows:


Output = 1 if Σ(w_i * x_i) ≥ θ
Output = 0 if Σ(w_i * x_i) < θ
Where:
● Output is the binary output of the neuron (0 or 1).
● w_i represents the weight associated with input x_i.
● x_i represents the binary input (0 or 1).
● Σ denotes the summation over all inputs.
Boolean Functions Using M-P Neuron:
A lot of boolean decision problems can be represented by the M-P neuron.
Notes By Sarvagya Jain
​ AND Operation:
● To implement the AND operation using McCulloch-Pitts neurons, you need two binary
inputs, x1 and x2, and a threshold (θ) of 2.
● Assign a weight of 1 to both inputs.
● If both inputs are 1, the sum of the weighted inputs exceeds the threshold, and the
neuron outputs 1; otherwise, it outputs 0.
x1 | x2 | Output
---------------
0 |0 | 0
0 |1 | 0
1 |0 | 0
1 |1 | 1

​ OR Operation:
● To implement the OR operation using McCulloch-Pitts neurons, you need two binary
inputs, x1 and x2, and a threshold (θ) of 1.
● Assign a weight of 1 to both inputs.
● If at least one input is 1, the sum of the weighted inputs exceeds the threshold, and the
neuron outputs 1; otherwise, it outputs 0.
x1 | x2 | Output
---------------
0 |0 | 0
0 |1 | 1
1 |0 | 1
1 |1 | 1
Notes By Sarvagya Jain

x1 | x2
|NOT Operation:
● The NOT operation is a unary operation that inverts a single binary input.
● To implement the NOT operation using McCulloch-Pitts neurons, you need one binary
input, x, and a threshold (θ) of 1.
● Assign a weight of -1 to the input.
● If the input is 0, the weighted input exceeds the threshold, and the neuron outputs 1; if
the input is 1, the weighted input does not exceed the threshold, and the neuron outputs
0.
​ x | Output
​ -----------
​ 0 | 1
​ 1 | 0
These are basic logical operations that can be implemented with McCulloch-Pitts neurons. The model
is limited to binary inputs and binary outputs and is primarily used for demonstrating fundamental
concepts in neural network theory.
Real-world neural networks, including perceptrons and deep neural networks, use more complex
operations and continuous activation functions to perform a wide range of tasks, including pattern
recognition and machine learning.

Artificial Neural Network


Artificial Neural Network is a training model with interconnected layers. ANN is the field of Artificial
Intelligence in which a model is trained with training data, and it is expected that model will behave
accurately with the actual data set.
The concept of Artificial Neural Network got its ideal from the biological nervous system of the brain.
In the human nervous system, there are neurons to receive and send signals; similarly, in the Neural
Network, there are nodes and edges.
Artificial Neural Network has three layers:
● Input Layer
● Hidden Layer
● Output Layer
Notes By Sarvagya Jain
MultiLayer Perceptron Neural Network
A Multilayer Perceptron (MLP) is a type of artificial neural network (ANN) commonly used in
machine learning and deep learning for various tasks, including classification, regression, and function
approximation. It is one of the simplest and most fundamental types of neural networks. An MLP
consists of multiple layers of artificial neurons, organized into an input layer, one or more hidden
layers, and an output layer. Each layer is fully connected to the next layer, meaning that each neuron in
a layer is connected to every neuron in the following layer.
A multilayer perceptron (MLP) Neural network belongs to the feedforward neural network.
The word Perceptron was first defined by Frank Rosenblatt in his perceptron program. Perceptron is a
basic unit of an artificial neural network that defines the artificial neuron in the neural network. It is a
supervised learning algorithm that contains nodes’ values, activation functions, inputs, and node
weights to calculate the output.
The Multilayer Perceptron (MLP) Neural Network works only in the forward direction. All nodes are
fully connected to the network. Each node passes its value to the coming node only in the forward
direction. The MLP neural network uses a Backpropagation algorithm to increase the accuracy of the
training model.
Structure of MultiLayer Perceptron Neural Network
​ Input Layer: The input layer receives the initial data or features for the neural network. Each
neuron in the input layer represents a feature or attribute of the input data. The number of
neurons in the input layer is equal to the number of input features.
​ Hidden Layers: Between the input and output layers, there can be one or more hidden layers.
The hidden layers play a crucial role in learning complex patterns and representations from the
input data. These layers contain one or more neurons, and the number of neurons in each
hidden layer is a hyperparameter that can be adjusted based on the problem's complexity.
​ Neurons (Perceptrons): Each neuron in the hidden layers and output layer performs a weighted
sum of its inputs, applies an activation function to the sum, and passes the result to the next
layer. The activation function introduces non-linearity into the network, allowing it to learn
complex relationships in the data. Common activation functions include sigmoid, hyperbolic
tangent (tanh), and rectified linear unit (ReLU).
​ Weights and Biases: Each connection between neurons is associated with a weight, which
determines the strength of the connection. Additionally, each neuron has an associated bias
term, which allows it to capture shifts in the data. These weights and biases are learned during
the training process through backpropagation and optimization techniques like gradient
descent.
​ Output Layer: The output layer produces the final predictions or values based on the
information learned in the hidden layers. The number of neurons in the output layer depends on
the nature of the task. For example, in a binary classification problem, there may be a single
output neuron with a sigmoid activation function. In a multi-class classification problem, there
can be multiple output neurons, often with a softmax activation function to produce probability
distributions over classes.
​ Training: MLPs are trained using supervised learning, where the network's weights and biases
are adjusted iteratively to minimize a loss function that measures the difference between the
Notes By Sarvagya Jain
predicted output and the actual target values. This process is typically done through
backpropagation, where gradients are computed with respect to the loss, and optimization
algorithms like stochastic gradient descent (SGD) are used to update the network's parameters.

Working of MultiLayer Perceptron Neural Network


● The input node represents the feature of the dataset.
● Each input node passes the vector input value to the hidden layer.
● In the hidden layer, each edge has some weight multiplied by the input variable. All the
production values from the hidden nodes are summed together. To generate the output
● The activation function is used in the hidden layer to identify the active nodes.
● The output is passed to the output layer.
● Calculate the difference between predicted and actual output at the output layer.
● The model uses backpropagation after calculating the predicted output.

Representation Power of Perceptron Networks


The representation power of a network refers to its ability to approximate complex functions or learn
intricate patterns from data. With a sufficient number of hidden layers and neurons, a multilayer
perceptron can approximate any continuous function to arbitrary precision, given enough training data
and appropriate training algorithms.
Notes By Sarvagya Jain
This property of multilayer perceptrons is known as the universal approximation theorem.
To illustrate the representation power of a Multilayer Perceptron (MLP), let's consider a simple
example involving the XOR (exclusive OR) function. XOR is a classic problem that is not linearly
separable, meaning it cannot be solved using a single straight line or plane. MLPs can effectively
represent and solve this problem due to their ability to model complex, non-linear relationships.
XOR Problem:
The XOR problem involves two binary inputs (0 or 1) and has the following truth table:

Input 1 Input 2 Output

0 0 0

0 1 1

1 0 1

1 1 0
As you can see, XOR produces an output of 1 when exactly one of the inputs is 1, and it produces an
output of 0 when both inputs are the same (either both 0 or both 1).
MLP Architecture:
Let's build a simple MLP to solve the XOR problem:
● Input layer with two neurons (corresponding to the two binary inputs).
● One hidden layer with two neurons, using a non-linear activation function like the sigmoid or
hyperbolic tangent (tanh).
● Output layer with one neuron and a sigmoid activation function to produce a binary
classification output (0 or 1).

Training:
We train the MLP on the XOR dataset using binary cross-entropy loss and backpropagation. As
training progresses, the MLP adjusts its weights and biases to minimize the loss, effectively learning to
represent the XOR function.
Result:
Notes By Sarvagya Jain
After training, the MLP will successfully learn to approximate the XOR function, effectively modeling
the non-linear decision boundary. Here's how the learned decision boundary might look:
An example of a network with two layers to implement XOR using perceptrons:
# First layer perceptrons
def AND(x1, x2):
w1 = 1
w2 = 1
bias = -1.5
activation = w1*x1 + w2*x2 + bias
if activation >= 0:
return 1
else:
return 0

def NAND(x1, x2):


# Implementing the NAND function using a perceptron
w1 = -1
w2 = -1
bias = 1.5
activation = w1*x1 + w2*x2 + bias
if activation >= 0:
return 1
else:
return 0

def OR(x1, x2):


w1 = 1
w2 = 1
bias = -0.5
activation = w1*x1 + w2*x2 + bias
if activation >= 0:
return 1
else:
return 0

# Second layer perceptron


def XOR(x1, x2):
# Implementing XOR using the previous layer perceptrons
first_layer_output = NAND(x1, x2)
second_layer_output = OR(x1, x2)
output = AND(first_layer_output, second_layer_output)
return output
Notes By Sarvagya Jain
print(XOR(0,0));
print(XOR(0,1));
print(XOR(1,0));
print(XOR(1,1));

A sigmoid neuron, also known as a logistic neuron or logistic regression unit, is a type of artificial
neuron used in neural networks and machine learning. It is called "sigmoid" due to its activation
function, which is the sigmoid function (often denoted as σ or Σ):

The sigmoid function takes the weighted sum of inputs and maps it to a range between 0 and 1, which
makes it suitable for binary classification problems and as an activation function in neural networks.
When the weighted sum is large and positive, the sigmoid output approaches 1, and when the weighted
sum is large and negative, the sigmoid output approaches 0. When the weighted sum is close to zero,
the sigmoid output is approximately 0.5.
Key characteristics of sigmoid neurons include:
​ Non-linearity: The sigmoid function introduces non-linearity into the neural network, allowing
it to model complex relationships in the data. This is crucial for solving tasks that are not
linearly separable.
​ Smoothness: The sigmoid function is smooth and differentiable, which is important for
gradient-based optimization algorithms like backpropagation. The smoothness allows for
efficient training of neural networks.
​ Binary Classification: Sigmoid neurons are often used in the output layer of a neural network
for binary classification tasks, where the goal is to predict one of two classes (e.g., 0 or 1).
Learning Algorithm
In this section, we will discuss an algorithm for learning the parameters w and b of the sigmoid neuron
model by using the gradient descent algorithm.
Notes By Sarvagya Jain

Minimize the Squared Error Loss


The objective of the learning algorithm is to determine the best possible values for the parameters,
such that the overall loss (squared error loss) of the model is minimized as much as possible. Here
goes the learning algorithm:

Sigmoid Learning Algorithm


We initialize w and b randomly. We then iterate over all the observations in the data, for each
observation find the corresponding predicted outcome using the sigmoid function and compute the
squared error loss. Based on the loss value, we will update the weights such that the overall loss of the
model at the new parameters will be less than the current loss of the model.

Loss Optimization
We will keep doing the update operation until we are satisfied. Till satisfied could mean any of the
following:
● The overall loss of the model becomes zero.
● The overall loss of the model becomes a very small value closer to zero.
● Iterating for a fixed number of passes based on computational capacity.

However, sigmoid neurons have some limitations, such as the vanishing gradient problem, which can
make training deep networks with sigmoid activations challenging. As a result, other activation
functions like the rectified linear unit (ReLU) and its variants have become more popular in modern
deep learning architectures.
Notes By Sarvagya Jain
Despite their limited use in hidden layers of deep neural networks, sigmoid neurons still play a role in
certain applications and can be useful in specific scenarios, especially when dealing with binary
classification problems or when the output needs to be in the [0, 1] range.
Program
In this program:
​ We define the input data (features) and labels (ground truth) for the XOR problem.
​ We create TensorFlow variables for the weights and bias of the sigmoid neuron.
​ The sigmoid_neuron function defines the forward pass operation of the sigmoid neuron.
​ We define the binary cross-entropy loss function for training.
​ We use stochastic gradient descent (SGD) as the optimizer.
​ The training loop iterates through epochs, computes gradients, and updates weights and bias.
​ Finally, we test the trained sigmoid neuron with the XOR input data.
import tensorflow as tf
import numpy as np
# Define the input data (features)
input_data = np.array([[0.0, 0.0],
[0.0, 1.0],
[1.0, 0.0],
[1.0, 1.0]], dtype=np.float32)
# Define the weights and bias for the sigmoid neuron
weights = tf.Variable(tf.random.normal([2, 1], mean=0.0, stddev=1.0), dtype=tf.float32)
bias = tf.Variable(tf.zeros([1]), dtype=tf.float32)
# Define the forward pass operation
def sigmoid_neuron(x):
z = tf.matmul(x, weights) + bias
return tf.sigmoid(z)
# Define the labels (ground truth)
labels = np.array([[0.0], [1.0], [1.0], [0.0]], dtype=np.float32)
# Define the loss function (binary cross-entropy)
def loss(y_true, y_pred):
return tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y_true, logits=y_pred))
# Define the optimizer
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)
# Training loop
epochs = 10000
for epoch in range(epochs):
with tf.GradientTape() as tape:
predictions = sigmoid_neuron(input_data)
current_loss = loss(labels, predictions)
gradients = tape.gradient(current_loss, [weights, bias])
optimizer.apply_gradients(zip(gradients, [weights, bias]))
if epoch % 1000 == 0:
Notes By Sarvagya Jain
print(f"Epoch {epoch}: Loss = {current_loss.numpy()}")
# Test the trained sigmoid neuron
test_input = np.array([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]], dtype=np.float32)
predicted_output = sigmoid_neuron(test_input).numpy()
print("Predicted Output:")
print(predicted_output)

Feed-Forward Neural Network


Neural networks feedforward, also known as multi-layered networks of neurons, are called
"feedforward," where information flows in one direction from the input to the output layer without
looping back. It is composed of three types of layers:
● Input Layer:
The input layer accepts the input data and passes it to the next layer.
● Hidden Layers:
One or more hidden layers that process and transform the input data. Each hidden layer has a
set of neurons connected to the neurons of the previous and next layers. These layers use
activation functions, such as ReLU or sigmoid, to introduce non-linearity into the network,
allowing it to learn and model more complex relationships between the inputs and outputs.
● Output Layer:
The output layer generates the final output. Depending on the type of problem, the number of
neurons in the output layer may vary. For example, in a binary classification problem, it would
only have one neuron. In contrast, a multi-class classification problem would have as many
neurons as the number of classes.
The purpose of Neural networks feedforward is to approximate certain functions. The input to the
network is a vector of values, x, which is passed through the network, layer by layer, and transformed
into an output, y. The network's final output predicts the target function for the given input. The
network makes this prediction using a set of parameters, θ (theta), adjusted during training to minimize
the error between the network's predictions and the target function.
The training involves adjusting the θ (theta) values to minimize errors. This is done by presenting the
network with a set of input-output pairs (also called training data) and computing the error between the
network's prediction and the true output for each pair. This error is then used to compute the gradient
of the error concerning the parameters, which tells us how to adjust the parameters to reduce the error.
This is done using optimization techniques like gradient descent. Once the training process is
completed, the network has " learned " the function and can be used to predict new input.
Finally, the network stores this optimal value of θ (theta) in its memory, so it can use it to predict new
inputs.
Notes By Sarvagya Jain

The neural network can compare the outputs of its nodes with the desired values using a property
known as the delta rule, allowing the network to alter its weights through training to create more
accurate output values. This training and learning procedure results in gradient descent. The technique
of updating weights in multi-layered perceptrons is virtually the same, however, the process is referred
to as back-propagation. In such circumstances, the output values provided by the final layer are used to
alter each hidden layer inside the network.
Backpropagation
Backpropagation, or backward propagation of errors, is an algorithm that is designed to test for errors
working back from output nodes to input nodes. It is an important mathematical tool for improving the
accuracy of predictions in data mining and machine learning. Essentially, backpropagation is an
algorithm used to calculate derivatives quickly.
There are two leading types of backpropagation networks:
1. Static backpropagation. Static backpropagation is a network developed to map static inputs
for static outputs. Static backpropagation networks can solve static classification problems,
such as optical character recognition (OCR).
2. Recurrent backpropagation. The recurrent backpropagation network is used for fixed-point
learning. Recurrent backpropagation activation feeds forward until it reaches a fixed value.
Artificial neural networks use backpropagation as a learning algorithm to compute a gradient descent
with respect to weight values for the various inputs. By comparing desired outputs to achieved system
outputs, the systems are tuned by adjusting connection weights to narrow the difference between the
two as much as possible.
The main features of Backpropagation are the iterative, recursive and efficient method through which
it calculates the updated weight to improve the network until it is not able to perform the task for
which it is being trained. Derivatives of the activation function to be known at network design time is
required for Backpropagation.
Notes By Sarvagya Jain
Now, how error function is used in Backpropagation and how Backpropagation works? Let's start with
an example and do it mathematically to understand exactly how to update the weight using
Backpropagation.

Input values
X1
X2
Initial weight
W1 w5
W2 w6
W3 w7
W4 w8
Bias Values
b1 b2
Target Values
T1
T2
Now, we first calculate the values of H1 and H2 by a forward pass.
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
To calculate the final result of H1, we performed the sigmoid function as

We will calculate the value of H2 in the same way as H1


Notes By Sarvagya Jain
H2=x1×w3+x2×w4+b1
To calculate the final result of H1, we performed the sigmoid function as

Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and H2.
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2 from the
weights as
y1=H1×w5+H2×w6+b2
To calculate the final result of y1 we performed the sigmoid function as

We will calculate the value of y2 in the same way as y1


y2=H1×w7+H2×w8+b2
To calculate the final result of H1, we performed the sigmoid function as

If Our y1 and y2 values are not matched with our target values T1 and T2.
Now, we will find the total error, which is simply the difference between the outputs from the target
outputs. The total error is calculated as

So, the total error is

Now, we will backpropagate this error to update the weights using a backward pass.
Backward pass at the output layer
To update the weight, we calculate the error corresponding to each weight with the help of a total error.
The error on weight w is calculated by differentiating total error with respect to w.

We perform backward process so first consider the last weight w5 as


Notes By Sarvagya Jain

From equation two, it is clear that we cannot partially differentiate it with respect to w5 because there
is no any w5. We split equation one into multiple terms so that we can easily differentiate it with
respect to w5 as

Now, we calculate each term one by one to differentiate Etotal with respect to w5 as

Putting the value of e-y in equation (5)

So, we put the values of in equation no (3) to find the final result.
Notes By Sarvagya Jain

So, we put the values of in equation no (3) to find the final result.

Now, we will calculate the updated weight w5new with the help of the following formula

In the same way, we calculate w6new,w7new, and w8new and this will give us the following values
Backward pass at Hidden layer
Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3, and w4 as we
have done with w5, w6, w7, and w8 weights.
We will calculate the error at w1 as

From equation (2), it is clear that we cannot partially differentiate it with respect to w1 because there is
no w1. We split equation (1) into multiple terms so that we can easily differentiate it with respect to
w1 as

Now, we calculate each term one by one to differentiate Etotal with respect to w1 as

We again split this because there is no any H1final term in Etoatal as


Notes By Sarvagya Jain

will again split because in E1 and E2 there is no H1 term. Splitting is done as

We again Split both because there is no any y1 and y2 term in E1 and E2. We split it as

Now, we find the value of by putting values in equation (18) and (19) as
From equation (18)

From equation (19)


Notes By Sarvagya Jain

Putting the value of e-y2 in equation (23)

Put the value of in equation (15) as

We have we need to figure out as


Notes By Sarvagya Jain

Putting the value of e-H1 in equation (30)

We calculate the partial derivative of the total net input to H1 with respect to w1 the same as we did
for the output neuron:

So, we put the values of in equation (13) to find the final result.

Now, we will calculate the updated weight w1new with the help of the following formula

In the same way, we calculate w2new,w3new, and w4 and this will give us the following values
Weight Initialization
Its main objective is to prevent layer activation outputs from exploding or vanishing gradients during
the forward propagation. If either of the problems occurs, loss gradients will either be too large or too
small, and the network will take more time to converge if it is even able to do so at all.
Notes By Sarvagya Jain
If we initialized the weights correctly, then our objective i.e, optimization of loss function will be
achieved in the least time otherwise converging to a minimum using gradient descent will be
impossible.

Different Weight Initialization Techniques


One of the important things which we have to keep in mind while building your neural network is to
initialize your weight matrix for different connections between layers.
Let us see the following two initialization scenarios which can cause issues while we training the
model:
Zero Initialization (Initialized all weights to 0)
If we initialized all the weights with 0, then what happens is that the derivative wrt loss function is the
same for every weight in W[l], thus all weights have the same value in subsequent iterations. This
makes hidden layers symmetric and this process continues for all the n iterations. Thus initialized
weights with zero make your network no better than a linear model. It is important to note that setting
biases to 0 will not create any problems as non-zero weights take care of breaking the symmetry and
even if bias is 0, the values in every neuron will still be different.
Random Initialization (Initialized weights randomly)
– This technique tries to address the problems of zero initialization since it prevents neurons from
learning the same features of their inputs since our goal is to make each neuron learn different
functions of its input and this technique gives much better accuracy than zero initialization.
– In general, it is used to break the symmetry. It is better to assign random values except 0 to weights.
– Remember, neural networks are very sensitive and prone to overfitting as it quickly memorizes the
training data.
Now, after reading this technique a new question comes to mind: “What happens if the weights
initialized randomly can be very high or very low?”
(a) Vanishing gradients :
● For any activation function, abs(dW) will get smaller and smaller as we go backward with
every layer during backpropagation, especially in the case of deep neural networks. So, in this
case, the earlier layers’ weights are adjusted slowly.
● Due to this, the weight update is minor which results in slower convergence.
● This makes the optimization of our loss function slow. It might be possible in the worst case,
this may completely stop the neural network from training further.
● More specifically, in the case of the sigmoid and tanh and activation functions, if your weights
are very large, then the gradient will be vanishingly small, effectively preventing the weights
from changing their value. This is because abs(dW) will increase very slightly or possibly get
smaller and smaller after the completion of every iteration.
● So, here comes the use of the RELU activation function in which vanishing gradients are
generally not a problem as the gradient is 0 for negative (and zero) values of inputs and 1 for
positive values of inputs.
(b) Exploding gradients :
● This is the exact opposite case of the vanishing gradients, which we discussed above.
Notes By Sarvagya Jain
● Consider we have weights that are non-negative, large, and having small activations A. When
these weights are multiplied along with the different layers, they cause a very large change in
the value of the overall gradient (cost). This means that the changes in W, given by the
equation W= W — ⍺ * dW, will be in huge steps, the downward moment will increase.
Problems occurred due to exploding gradients:
– This problem might result in the oscillation of the optimizer around the minima or even overshooting
the optimum again and again and the model will never learn!
– Due to the large values of the gradients, it may cause numbers to overflow which results in incorrect
computations or introductions of NaN’s (missing values).

Best Practices for Weight Initialization


Use RELU or leaky RELU as the activation function, as they both are relatively robust to the
vanishing or exploding gradient problems (especially for networks that are not too deep). In the case of
leaky RELU, they never have zero gradients. Thus they never die and training continues.
Use Heuristics for weight initialization: For deep neural networks, we can use any of the following
heuristics to initialize the weights depending on the chosen non-linear activation function.
While these heuristics do not completely solve the exploding or vanishing gradients problems, they
help to reduce it to a great extent. The most common heuristics are as follows:
(a) For RELU activation function: This heuristic is called He-et-al Initialization.
In this heuristic, we multiplied the randomly generated values of W by:

(b) For tanh activation function : This heuristic is known as Xavier initialization.
In this heuristic, we multiplied the randomly generated values of W by:

(c) Another commonly used heuristic is:

Benefits of using these heuristics:


Notes By Sarvagya Jain
● All these heuristics serve as good starting points for weight initialization and they reduce the
chances of exploding or vanishing gradients.
● All these heuristics do not vanish or explode too quickly, as the weights are neither too much
bigger than 1 nor too much less than 1.
● They help to avoid slow convergence and ensure that we do not keep oscillating off the
minima.
Gradient Clipping: It is another way for dealing with the exploding gradient problem. In this
technique, we set a threshold value, and if our chosen function of a gradient is larger than this
threshold, then we set it to another value.
Batch Normalization
One of the most common problems of data science professionals is to avoid over-fitting. Have you
come across a situation when your model is performing very well on the training data but is unable to
predict the test data accurately. The reason is your model is overfitting. The solution to such a problem
is regularization.
The regularization techniques help to improve a model and allows it to converge faster. We have
several regularization tools at our end, some of them are early stopping, dropout, weight initialization
techniques, and batch normalization. The regularization helps in preventing the over-fitting of the
model and the learning process becomes more efficient.
Normalization is a data pre-processing tool used to bring the numerical data to a common scale
without distorting its shape.
Generally, when we input the data to a machine or deep learning algorithm we tend to change the
values to a balanced scale. The reason we normalize is partly to ensure that our model can generalize
appropriately.
Batch normalization, it is a process to make neural networks faster and more stable through adding
extra layers in a deep neural network. The new layer performs the standardizing and normalizing
operations on the input of a layer coming from a previous layer.
But what is the reason behind the term “Batch” in batch normalization? A typical neural network is
trained using a collected set of input data called batch. Similarly, the normalizing process in batch
normalization takes place in batches, not as a single input.
Let’s understand this through an example, we have a deep neural network as shown in the following
image.
Notes By Sarvagya Jain

Initially, our inputs X1, X2, X3, X4 are in normalized form as they are coming from the
pre-processing stage. When the input passes through the first layer, it transforms, as a sigmoid function
applied over the dot product of input X and the weight matrix W. Similarly, this transformation will
take place for the second layer and go till the last layer L as shown in the following image.

Although our input X was normalized with time the output will no longer be on the same scale. As the
Notes By Sarvagya Jain
data go through multiple layers of the neural network and L activation functions are applied, it leads to
an internal covariate shift in the data.
How does Batch Normalization work?
Since by now we have a clear idea of why we need Batch normalization, let’s understand how it
works. It is a two-step process. First, the input is normalized, and later rescaling and offsetting is
performed.
Normalization of the Input
Normalization is the process of transforming the data to have a mean zero and standard deviation one.
In this step we have our batch input from layer h, first, we need to calculate the mean of this hidden
activation.

Here, m is the number of neurons at layer h.


Once we have meant at our end, the next step is to calculate the standard deviation of the hidden
activations.

Further, as we have the mean and the standard deviation ready. We will normalize the hidden
activations using these values. For this, we will subtract the mean from each input and divide the
whole value with the sum of standard deviation and the smoothing term (ε).
The smoothing term(ε) assures numerical stability within the operation by stopping a division by a
zero value.

Rescaling of Offsetting
In the final operation, the re-scaling and offsetting of the input take place. Here two components of the
BN algorithm come into the picture, γ(gamma) and β (beta). These parameters are used for re-scaling
(γ) and shifting(β) of the vector containing values from the previous operations.

These two are learnable parameters, during the training neural network ensures the optimal values of γ
and β are used. That will enable the accurate normalization of each batch.
Representation Learning
Representation learning is a class of machine learning approaches that allow a system to discover the
representations required for feature detection or classification from raw data. The requirement for
manual feature engineering is reduced by allowing a machine to learn the features and apply them to a
given activity.
Notes By Sarvagya Jain
In representation learning, data is sent into the machine, and it learns the representation on its own. It
is a way of determining a data representation of the features, the distance function, and the similarity
function that determines how the predictive model will perform. Representation learning works by
reducing high-dimensional data to low-dimensional data, making it easier to discover patterns and
anomalies while also providing a better understanding of the data’s overall behavior.
Basically, Machine learning tasks such as classification frequently demand input that is
mathematically and computationally convenient to process, which motivates representation learning.
Real-world data, such as photos, video, and sensor data, has resisted attempts to define certain
qualities algorithmically. An approach is to examine the data for such traits or representations rather
than depending on explicit techniques.
Methods of Representation Learning
Representation learning can improve the model’s performance in three different learning frameworks:
supervised learning, unsupervised learning.
Supervised Learning
This is referred to as supervised learning when the ML or DL model maps the input X to the output Y.
The computer tries to correct itself by comparing model output to ground truth, and the learning
process optimizes the mapping from input to output. This process is repeated until the optimization
function reaches global minima.
Using labeled input data, features are learned in supervised feature learning. Supervised neural
networks, multilayer perceptrons, and (supervised) dictionary learning are some examples.
Supervised Methods
● Supervised Dictionary Learning
● Multi-Layer Perceptron
● Neural Networks

Unsupervised Learning
Unsupervised learning is a sort of machine learning in which the labels are ignored in favour of the
observation itself. Unsupervised learning isn’t used for classification or regression; instead, it’s used to
uncover underlying patterns, cluster data, denoise it, detect outliers, and decompose data, among other
things.
Unsupervised feature learning learns features from unlabeled input data by following the methods such
as Dictionary learning, independent component analysis, autoencoders, matrix factorization, and
various forms of clustering are among examples.
Unsupervised Methods
Learning Representation from unlabeled data is referred to as unsupervised feature learning.
Unsupervised Representation learning frequently seeks to uncover low-dimensional features that
encapsulate some structure beneath the high-dimensional input data.
● K-means clustering
● Local Linear Embedding
● Unsupervised Dictionary Mining
Deep Architectures Methods
Notes By Sarvagya Jain
Deep learning architectures for feature learning are inspired by the hierarchical architecture of the
biological brain system, which stacks numerous layers of learning nodes. The premise of distributed
representation is typically used to construct these architectures: observable data is generated by the
interactions of many diverse components at several levels.
● Restricted Boltzmann Machine (RBMs)
● Autoencoders

HOW DEEP NEURAL NETS LEARN REPRESENTATIONS


In order to explain how deep neural nets learn representations, we must first take a brief detour to
explain how they learn to make predictions, for reasons that will become obvious shortly.
At present, deep neural nets learn to make predictions via an iterative training process, during which
we repeatedly feed sample input data and gradually adjust the behavior of all layers of neurons in the
neural net so they jointly learn to transform input data into good predictions. We, the makers of the
neural net, decide what those predictions should be and also specify precisely how to measure their
goodness to induce learning.
During training, each layer of neurons learns to transform its inputs into a different representation.
This representation is itself an input to subsequent layers, which, in turn, learn to transform it into yet
other representations. Eventually, we reach the final layer, which learns to transform the last
representation into good predictions (however specified). Each layer, in effect, learns to make the next
lawyer's job a little easier:

In the case of a feedforward deep neural net like the one depicted above, the layers tend to get smaller
(i.e., contain fewer neurons) the deeper we go into the neural net. During training, this neural net is
this "architecturally biased" to learn to transform input data into progressively smaller representations
that retain the information necessary for making good predictions. The deeper we go into this neural
net, the smaller the representations get, and the more abstract they become. The final layer learns to
transform highly abstract representations of input data into good predictions.
For example, if the initial layer's input contains millions of pixel values representing a digital photo,
the final layer's input, i.e., the last representation, might contain a few thousand values which together
indicate, say, the likely presence or absence of certain kinds of objects in the photo. If the initial layer's
input contains a sequence of bit values representing text, the last representation might contain values
which together indicate, say, the likely presence or absence of certain patterns of thought in the text.
Notes By Sarvagya Jain
In case you are wondering: yes, this is a form of intelligence -- a rudimentary, human-guided, artificial
form of perceptual intelligence, but intelligence nonetheless. (What else could it be?)
WE DISCARD THE PREDICTIONS AND KEEP THE REPRESENTATIONS
Once a neural net is trained as described above, we can perform surgery on it, so it outputs
representations instead of predictions when presented with new input data. For example, let's say we
have a feedforward neural net that has already learned to make good predictions (however we
specified them during training). We slice off its last layer, as shown below, and voilà! We end up with
a slightly shorter neural net, which, when fed input data, outputs an abstract representation of it.

We can use the representations produced by this shorter neural net for a purpose different from the
prediction objectives we specified to induce learning in the training phase. For example, we could feed
the representations as inputs to a second neural net for a different learning task. The second neural net
would benefit from the representational knowledge learned by the first one -- that is, assuming the
prediction objectives we specified to train the first neural net induced learning of representations
useful for the second one.
If the main or only purpose of training a neural net is to learn useful representations, the discarded
prediction objectives are more accurately described as training objectives.
Graphics processing unit (GPU) Implementation
Graphics processing unit (GPU) is used for a faster artificial neural network. It is used to implement
the matrix multiplication of a neural network to enhance the time performance of a text detection
system.
There are many :neural networks, one of the most popular is multilayer perceptron.
GPU Processing:
GPUs are designed for high performance rendering where the repeated operation is common. They are
more effective and utilize parallelism and are more pipelined than general purpose cpu therefore in
areas where repeated operations are common a gpu can produce better performance . the mechanism
of general computation a gpu is as follows
1. The input is transformed to the gpu as texture or vertex values.
2. The computation is then performed by the vertex shader and pixel shader during the number of
rendering processes.
3. The vertex shader performs a routine for every vertex that involve computing its position color
and texture coordinates
Notes By Sarvagya Jain
4. The pixel shader is performed by calling an output the color of the pixel.
5. vertex shaders can be used to completely transform the shape of an object, pixel shaders are
used to change the appearance of the pixels.
The parallelism of a GPU is fully utilized by accumulating a lot of input feature vectors and
weight vectors, then converting the many inner-product operations into one matrix operation.
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain

You might also like