DL Unit - 1 Notes
DL Unit - 1 Notes
AI vs ML vs DL:
AI (Artificial Intelligence), ML (Machine Learning), and DL (Deep Learning) are three closely related
but distinct concepts in the field of computer science and data science.
Artificial Intelligence (AI): Developing machines to mimic human intelligence and behavior. It
includes reasoning, problem-solving, understanding natural language, recognizing patterns, planning,
and decision-making.
Machine Learning (ML): Algorithms that learn from structured data to predict outputs and discover
patterns in that data. ML encompasses various techniques, including supervised learning, unsupervised
learning, and reinforcement learning, among others.
Deep Learning (DL): Algorithms based on highly complex neural networks that mimic the way a
human brain works to detect patterns in large unstructured data sets. DL architectures include
convolutional neural networks (CNNs) for image processing and recurrent neural networks (RNNs)
for sequential data.
In reality, it is not just a couple of neurons which would do the decision making. There is a massively
parallel interconnected network of 10¹¹ neurons (100 billion) in our brain and their connections are not
as simple.
The sense organs pass the information to the first/lowest layer of neurons to process it. And the output
of the processes is passed on to the next layers in a hierarchical manner, some of the neurons will fire
and some won’t and this process goes on until it results in a final response in this case, laughter.
This massively parallel network also ensures that there is a division of work. Each neuron only fires
when its intended criteria is met i.e neuron may perform a certain role to a certain stimulus, as shown
below.
Notes By Sarvagya Jain
These inputs can either be excitatory or inhibitory. Inhibitory inputs are those that have maximum
effect on the decision making irrespective of other inputs i.e., if x_3 is 1 (not home) then my output
will always be 0 i.e., the neuron will never fire, so x_3 is an inhibitory input. Excitatory inputs are
Notes By Sarvagya Jain
NOT the ones that will make the neuron fire on their own but they might fire it when combined
together. Formally, this is what is going on:
We can see that g(x) is just doing a sum of the inputs — a simple aggregation. And theta here is called
thresholding parameter. For example, if I always watch the game when the sum turns out to be 2 or
more, the theta is 2 here. This is called the Thresholding Logic.
Key characteristics of the McCulloch-Pitts neuron model include:
Binary Activation: In the M-P neuron model, both the inputs and the neuron's output are
binary, taking on values of 0 or 1. This binary nature simplifies the mathematical description.
Inputs and Weights: The neuron receives inputs from other neurons or external sources. Each
input is associated with a weight, which represents the importance or strength of that input.
These weights can be either excitatory (positive) or inhibitory (negative).
Threshold: The neuron has a threshold value, often denoted as "θ" (theta). The inputs are
linearly combined with their respective weights, and if this sum exceeds the threshold, the
neuron fires (outputs 1); otherwise, it remains inactive (outputs 0).
OR Operation:
● To implement the OR operation using McCulloch-Pitts neurons, you need two binary
inputs, x1 and x2, and a threshold (θ) of 1.
● Assign a weight of 1 to both inputs.
● If at least one input is 1, the sum of the weighted inputs exceeds the threshold, and the
neuron outputs 1; otherwise, it outputs 0.
x1 | x2 | Output
---------------
0 |0 | 0
0 |1 | 1
1 |0 | 1
1 |1 | 1
Notes By Sarvagya Jain
x1 | x2
|NOT Operation:
● The NOT operation is a unary operation that inverts a single binary input.
● To implement the NOT operation using McCulloch-Pitts neurons, you need one binary
input, x, and a threshold (θ) of 1.
● Assign a weight of -1 to the input.
● If the input is 0, the weighted input exceeds the threshold, and the neuron outputs 1; if
the input is 1, the weighted input does not exceed the threshold, and the neuron outputs
0.
x | Output
-----------
0 | 1
1 | 0
These are basic logical operations that can be implemented with McCulloch-Pitts neurons. The model
is limited to binary inputs and binary outputs and is primarily used for demonstrating fundamental
concepts in neural network theory.
Real-world neural networks, including perceptrons and deep neural networks, use more complex
operations and continuous activation functions to perform a wide range of tasks, including pattern
recognition and machine learning.
0 0 0
0 1 1
1 0 1
1 1 0
As you can see, XOR produces an output of 1 when exactly one of the inputs is 1, and it produces an
output of 0 when both inputs are the same (either both 0 or both 1).
MLP Architecture:
Let's build a simple MLP to solve the XOR problem:
● Input layer with two neurons (corresponding to the two binary inputs).
● One hidden layer with two neurons, using a non-linear activation function like the sigmoid or
hyperbolic tangent (tanh).
● Output layer with one neuron and a sigmoid activation function to produce a binary
classification output (0 or 1).
Training:
We train the MLP on the XOR dataset using binary cross-entropy loss and backpropagation. As
training progresses, the MLP adjusts its weights and biases to minimize the loss, effectively learning to
represent the XOR function.
Result:
Notes By Sarvagya Jain
After training, the MLP will successfully learn to approximate the XOR function, effectively modeling
the non-linear decision boundary. Here's how the learned decision boundary might look:
An example of a network with two layers to implement XOR using perceptrons:
# First layer perceptrons
def AND(x1, x2):
w1 = 1
w2 = 1
bias = -1.5
activation = w1*x1 + w2*x2 + bias
if activation >= 0:
return 1
else:
return 0
A sigmoid neuron, also known as a logistic neuron or logistic regression unit, is a type of artificial
neuron used in neural networks and machine learning. It is called "sigmoid" due to its activation
function, which is the sigmoid function (often denoted as σ or Σ):
The sigmoid function takes the weighted sum of inputs and maps it to a range between 0 and 1, which
makes it suitable for binary classification problems and as an activation function in neural networks.
When the weighted sum is large and positive, the sigmoid output approaches 1, and when the weighted
sum is large and negative, the sigmoid output approaches 0. When the weighted sum is close to zero,
the sigmoid output is approximately 0.5.
Key characteristics of sigmoid neurons include:
Non-linearity: The sigmoid function introduces non-linearity into the neural network, allowing
it to model complex relationships in the data. This is crucial for solving tasks that are not
linearly separable.
Smoothness: The sigmoid function is smooth and differentiable, which is important for
gradient-based optimization algorithms like backpropagation. The smoothness allows for
efficient training of neural networks.
Binary Classification: Sigmoid neurons are often used in the output layer of a neural network
for binary classification tasks, where the goal is to predict one of two classes (e.g., 0 or 1).
Learning Algorithm
In this section, we will discuss an algorithm for learning the parameters w and b of the sigmoid neuron
model by using the gradient descent algorithm.
Notes By Sarvagya Jain
Loss Optimization
We will keep doing the update operation until we are satisfied. Till satisfied could mean any of the
following:
● The overall loss of the model becomes zero.
● The overall loss of the model becomes a very small value closer to zero.
● Iterating for a fixed number of passes based on computational capacity.
However, sigmoid neurons have some limitations, such as the vanishing gradient problem, which can
make training deep networks with sigmoid activations challenging. As a result, other activation
functions like the rectified linear unit (ReLU) and its variants have become more popular in modern
deep learning architectures.
Notes By Sarvagya Jain
Despite their limited use in hidden layers of deep neural networks, sigmoid neurons still play a role in
certain applications and can be useful in specific scenarios, especially when dealing with binary
classification problems or when the output needs to be in the [0, 1] range.
Program
In this program:
We define the input data (features) and labels (ground truth) for the XOR problem.
We create TensorFlow variables for the weights and bias of the sigmoid neuron.
The sigmoid_neuron function defines the forward pass operation of the sigmoid neuron.
We define the binary cross-entropy loss function for training.
We use stochastic gradient descent (SGD) as the optimizer.
The training loop iterates through epochs, computes gradients, and updates weights and bias.
Finally, we test the trained sigmoid neuron with the XOR input data.
import tensorflow as tf
import numpy as np
# Define the input data (features)
input_data = np.array([[0.0, 0.0],
[0.0, 1.0],
[1.0, 0.0],
[1.0, 1.0]], dtype=np.float32)
# Define the weights and bias for the sigmoid neuron
weights = tf.Variable(tf.random.normal([2, 1], mean=0.0, stddev=1.0), dtype=tf.float32)
bias = tf.Variable(tf.zeros([1]), dtype=tf.float32)
# Define the forward pass operation
def sigmoid_neuron(x):
z = tf.matmul(x, weights) + bias
return tf.sigmoid(z)
# Define the labels (ground truth)
labels = np.array([[0.0], [1.0], [1.0], [0.0]], dtype=np.float32)
# Define the loss function (binary cross-entropy)
def loss(y_true, y_pred):
return tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y_true, logits=y_pred))
# Define the optimizer
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)
# Training loop
epochs = 10000
for epoch in range(epochs):
with tf.GradientTape() as tape:
predictions = sigmoid_neuron(input_data)
current_loss = loss(labels, predictions)
gradients = tape.gradient(current_loss, [weights, bias])
optimizer.apply_gradients(zip(gradients, [weights, bias]))
if epoch % 1000 == 0:
Notes By Sarvagya Jain
print(f"Epoch {epoch}: Loss = {current_loss.numpy()}")
# Test the trained sigmoid neuron
test_input = np.array([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]], dtype=np.float32)
predicted_output = sigmoid_neuron(test_input).numpy()
print("Predicted Output:")
print(predicted_output)
The neural network can compare the outputs of its nodes with the desired values using a property
known as the delta rule, allowing the network to alter its weights through training to create more
accurate output values. This training and learning procedure results in gradient descent. The technique
of updating weights in multi-layered perceptrons is virtually the same, however, the process is referred
to as back-propagation. In such circumstances, the output values provided by the final layer are used to
alter each hidden layer inside the network.
Backpropagation
Backpropagation, or backward propagation of errors, is an algorithm that is designed to test for errors
working back from output nodes to input nodes. It is an important mathematical tool for improving the
accuracy of predictions in data mining and machine learning. Essentially, backpropagation is an
algorithm used to calculate derivatives quickly.
There are two leading types of backpropagation networks:
1. Static backpropagation. Static backpropagation is a network developed to map static inputs
for static outputs. Static backpropagation networks can solve static classification problems,
such as optical character recognition (OCR).
2. Recurrent backpropagation. The recurrent backpropagation network is used for fixed-point
learning. Recurrent backpropagation activation feeds forward until it reaches a fixed value.
Artificial neural networks use backpropagation as a learning algorithm to compute a gradient descent
with respect to weight values for the various inputs. By comparing desired outputs to achieved system
outputs, the systems are tuned by adjusting connection weights to narrow the difference between the
two as much as possible.
The main features of Backpropagation are the iterative, recursive and efficient method through which
it calculates the updated weight to improve the network until it is not able to perform the task for
which it is being trained. Derivatives of the activation function to be known at network design time is
required for Backpropagation.
Notes By Sarvagya Jain
Now, how error function is used in Backpropagation and how Backpropagation works? Let's start with
an example and do it mathematically to understand exactly how to update the weight using
Backpropagation.
Input values
X1
X2
Initial weight
W1 w5
W2 w6
W3 w7
W4 w8
Bias Values
b1 b2
Target Values
T1
T2
Now, we first calculate the values of H1 and H2 by a forward pass.
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
To calculate the final result of H1, we performed the sigmoid function as
Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and H2.
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2 from the
weights as
y1=H1×w5+H2×w6+b2
To calculate the final result of y1 we performed the sigmoid function as
If Our y1 and y2 values are not matched with our target values T1 and T2.
Now, we will find the total error, which is simply the difference between the outputs from the target
outputs. The total error is calculated as
Now, we will backpropagate this error to update the weights using a backward pass.
Backward pass at the output layer
To update the weight, we calculate the error corresponding to each weight with the help of a total error.
The error on weight w is calculated by differentiating total error with respect to w.
From equation two, it is clear that we cannot partially differentiate it with respect to w5 because there
is no any w5. We split equation one into multiple terms so that we can easily differentiate it with
respect to w5 as
Now, we calculate each term one by one to differentiate Etotal with respect to w5 as
So, we put the values of in equation no (3) to find the final result.
Notes By Sarvagya Jain
So, we put the values of in equation no (3) to find the final result.
Now, we will calculate the updated weight w5new with the help of the following formula
In the same way, we calculate w6new,w7new, and w8new and this will give us the following values
Backward pass at Hidden layer
Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3, and w4 as we
have done with w5, w6, w7, and w8 weights.
We will calculate the error at w1 as
From equation (2), it is clear that we cannot partially differentiate it with respect to w1 because there is
no w1. We split equation (1) into multiple terms so that we can easily differentiate it with respect to
w1 as
Now, we calculate each term one by one to differentiate Etotal with respect to w1 as
We again Split both because there is no any y1 and y2 term in E1 and E2. We split it as
Now, we find the value of by putting values in equation (18) and (19) as
From equation (18)
We calculate the partial derivative of the total net input to H1 with respect to w1 the same as we did
for the output neuron:
So, we put the values of in equation (13) to find the final result.
Now, we will calculate the updated weight w1new with the help of the following formula
In the same way, we calculate w2new,w3new, and w4 and this will give us the following values
Weight Initialization
Its main objective is to prevent layer activation outputs from exploding or vanishing gradients during
the forward propagation. If either of the problems occurs, loss gradients will either be too large or too
small, and the network will take more time to converge if it is even able to do so at all.
Notes By Sarvagya Jain
If we initialized the weights correctly, then our objective i.e, optimization of loss function will be
achieved in the least time otherwise converging to a minimum using gradient descent will be
impossible.
(b) For tanh activation function : This heuristic is known as Xavier initialization.
In this heuristic, we multiplied the randomly generated values of W by:
Initially, our inputs X1, X2, X3, X4 are in normalized form as they are coming from the
pre-processing stage. When the input passes through the first layer, it transforms, as a sigmoid function
applied over the dot product of input X and the weight matrix W. Similarly, this transformation will
take place for the second layer and go till the last layer L as shown in the following image.
Although our input X was normalized with time the output will no longer be on the same scale. As the
Notes By Sarvagya Jain
data go through multiple layers of the neural network and L activation functions are applied, it leads to
an internal covariate shift in the data.
How does Batch Normalization work?
Since by now we have a clear idea of why we need Batch normalization, let’s understand how it
works. It is a two-step process. First, the input is normalized, and later rescaling and offsetting is
performed.
Normalization of the Input
Normalization is the process of transforming the data to have a mean zero and standard deviation one.
In this step we have our batch input from layer h, first, we need to calculate the mean of this hidden
activation.
Further, as we have the mean and the standard deviation ready. We will normalize the hidden
activations using these values. For this, we will subtract the mean from each input and divide the
whole value with the sum of standard deviation and the smoothing term (ε).
The smoothing term(ε) assures numerical stability within the operation by stopping a division by a
zero value.
Rescaling of Offsetting
In the final operation, the re-scaling and offsetting of the input take place. Here two components of the
BN algorithm come into the picture, γ(gamma) and β (beta). These parameters are used for re-scaling
(γ) and shifting(β) of the vector containing values from the previous operations.
These two are learnable parameters, during the training neural network ensures the optimal values of γ
and β are used. That will enable the accurate normalization of each batch.
Representation Learning
Representation learning is a class of machine learning approaches that allow a system to discover the
representations required for feature detection or classification from raw data. The requirement for
manual feature engineering is reduced by allowing a machine to learn the features and apply them to a
given activity.
Notes By Sarvagya Jain
In representation learning, data is sent into the machine, and it learns the representation on its own. It
is a way of determining a data representation of the features, the distance function, and the similarity
function that determines how the predictive model will perform. Representation learning works by
reducing high-dimensional data to low-dimensional data, making it easier to discover patterns and
anomalies while also providing a better understanding of the data’s overall behavior.
Basically, Machine learning tasks such as classification frequently demand input that is
mathematically and computationally convenient to process, which motivates representation learning.
Real-world data, such as photos, video, and sensor data, has resisted attempts to define certain
qualities algorithmically. An approach is to examine the data for such traits or representations rather
than depending on explicit techniques.
Methods of Representation Learning
Representation learning can improve the model’s performance in three different learning frameworks:
supervised learning, unsupervised learning.
Supervised Learning
This is referred to as supervised learning when the ML or DL model maps the input X to the output Y.
The computer tries to correct itself by comparing model output to ground truth, and the learning
process optimizes the mapping from input to output. This process is repeated until the optimization
function reaches global minima.
Using labeled input data, features are learned in supervised feature learning. Supervised neural
networks, multilayer perceptrons, and (supervised) dictionary learning are some examples.
Supervised Methods
● Supervised Dictionary Learning
● Multi-Layer Perceptron
● Neural Networks
Unsupervised Learning
Unsupervised learning is a sort of machine learning in which the labels are ignored in favour of the
observation itself. Unsupervised learning isn’t used for classification or regression; instead, it’s used to
uncover underlying patterns, cluster data, denoise it, detect outliers, and decompose data, among other
things.
Unsupervised feature learning learns features from unlabeled input data by following the methods such
as Dictionary learning, independent component analysis, autoencoders, matrix factorization, and
various forms of clustering are among examples.
Unsupervised Methods
Learning Representation from unlabeled data is referred to as unsupervised feature learning.
Unsupervised Representation learning frequently seeks to uncover low-dimensional features that
encapsulate some structure beneath the high-dimensional input data.
● K-means clustering
● Local Linear Embedding
● Unsupervised Dictionary Mining
Deep Architectures Methods
Notes By Sarvagya Jain
Deep learning architectures for feature learning are inspired by the hierarchical architecture of the
biological brain system, which stacks numerous layers of learning nodes. The premise of distributed
representation is typically used to construct these architectures: observable data is generated by the
interactions of many diverse components at several levels.
● Restricted Boltzmann Machine (RBMs)
● Autoencoders
In the case of a feedforward deep neural net like the one depicted above, the layers tend to get smaller
(i.e., contain fewer neurons) the deeper we go into the neural net. During training, this neural net is
this "architecturally biased" to learn to transform input data into progressively smaller representations
that retain the information necessary for making good predictions. The deeper we go into this neural
net, the smaller the representations get, and the more abstract they become. The final layer learns to
transform highly abstract representations of input data into good predictions.
For example, if the initial layer's input contains millions of pixel values representing a digital photo,
the final layer's input, i.e., the last representation, might contain a few thousand values which together
indicate, say, the likely presence or absence of certain kinds of objects in the photo. If the initial layer's
input contains a sequence of bit values representing text, the last representation might contain values
which together indicate, say, the likely presence or absence of certain patterns of thought in the text.
Notes By Sarvagya Jain
In case you are wondering: yes, this is a form of intelligence -- a rudimentary, human-guided, artificial
form of perceptual intelligence, but intelligence nonetheless. (What else could it be?)
WE DISCARD THE PREDICTIONS AND KEEP THE REPRESENTATIONS
Once a neural net is trained as described above, we can perform surgery on it, so it outputs
representations instead of predictions when presented with new input data. For example, let's say we
have a feedforward neural net that has already learned to make good predictions (however we
specified them during training). We slice off its last layer, as shown below, and voilà! We end up with
a slightly shorter neural net, which, when fed input data, outputs an abstract representation of it.
We can use the representations produced by this shorter neural net for a purpose different from the
prediction objectives we specified to induce learning in the training phase. For example, we could feed
the representations as inputs to a second neural net for a different learning task. The second neural net
would benefit from the representational knowledge learned by the first one -- that is, assuming the
prediction objectives we specified to train the first neural net induced learning of representations
useful for the second one.
If the main or only purpose of training a neural net is to learn useful representations, the discarded
prediction objectives are more accurately described as training objectives.
Graphics processing unit (GPU) Implementation
Graphics processing unit (GPU) is used for a faster artificial neural network. It is used to implement
the matrix multiplication of a neural network to enhance the time performance of a text detection
system.
There are many :neural networks, one of the most popular is multilayer perceptron.
GPU Processing:
GPUs are designed for high performance rendering where the repeated operation is common. They are
more effective and utilize parallelism and are more pipelined than general purpose cpu therefore in
areas where repeated operations are common a gpu can produce better performance . the mechanism
of general computation a gpu is as follows
1. The input is transformed to the gpu as texture or vertex values.
2. The computation is then performed by the vertex shader and pixel shader during the number of
rendering processes.
3. The vertex shader performs a routine for every vertex that involve computing its position color
and texture coordinates
Notes By Sarvagya Jain
4. The pixel shader is performed by calling an output the color of the pixel.
5. vertex shaders can be used to completely transform the shape of an object, pixel shaders are
used to change the appearance of the pixels.
The parallelism of a GPU is fully utilized by accumulating a lot of input feature vectors and
weight vectors, then converting the many inner-product operations into one matrix operation.
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain
Notes By Sarvagya Jain