UNIT-3
Classification in Machine Learning
Classification may be defined as the process of predicting class or category from observed
values or given data points. The categorized output can have the form such as "Black" or
"White" or "spam" or "no spam".
Classification in machine learning is a supervised learning technique where an algorithm is
trained with labelled data to predict the category of new data.
Mathematically, classification is the task of approximating a mapping function (f) from input
variables (X) to output variables (Y). It basically belongs to the supervised machine learning
in which targets are also provided along with the input data set.
Types of Classification
There are different types of classification problems depending on how many categories (or
classes) we are working with and how they are organized. There are two main classification
types in machine learning:
1. Binary Classification
This is the simplest kind of classification. In binary classification, the goal is to sort the data
into two distinct categories. Think of it like a simple choice between two options. Imagine a
system that sorts emails into either spam or not spam. It works by looking at different
features of the email like certain keywords or sender details, and decides whether it’s spam or
not. It only chooses between these two options.
2. Multiclass Classification
Here, instead of just two categories, the data needs to be sorted into more than two categories.
The model picks the one that best matches the input. Think of an image recognition system that
sorts pictures of animals into categories like cat, dog, and bird.
Basically, machine looks at the features in the image (like shape, color, or texture) and
chooses which animal the picture is most likely to be based on the training it received.
Binary classification vs Multi class classification
3. Multi-Label Classification
In multi-label classification single piece of data can belong to multiple categories at once.
Unlike multiclass classification where each data point belongs to only one class, multi-label
classification allows datapoints to belong to multiple classes. A movie recommendation
system could tag a movie as both action and comedy. The system checks various features (like
movie plot, actors, or genre tags) and assigns multiple labels to a single piece of data, rather
than just one.
Working of Classification
Classification involves training a model using a labelled dataset, where each input is paired
with its correct output label. The model learns patterns and relationships in the data, so it can
later predict labels for new, unseen inputs.
In machine learning, classification works by training a model to learn patterns from labelled
data, so it can predict the category or class of new, unseen data. Here's how it works:
1. Data Collection: You start with a dataset where each item is labelled with the correct
class (for example, "cat" or "dog").
2. Feature Extraction: The system identifies features (like color, shape, or texture) that
help distinguish one class from another. These features are what the model uses to make
predictions.
3. Model Training: Classification - machine learning algorithm uses the labelled data to
learn how to map the features to the correct class. It looks for patterns and relationships
in the data.
4. Model Evaluation: Once the model is trained, it's tested on new, unseen data to check
how accurately it can classify the items.
5. Prediction: After being trained and evaluated, the model can be used to predict the
class of new data based on the features it has learned.
6. Model Evaluation: Evaluating a classification model is a key step in machine learning.
It helps us check how well the model performs and how good it is at handling new,
unseen data. Depending on the problem and needs we can use different metrics to
measure its performance.
If the quality metric is not satisfactory, the ML algorithm or hyperparameters can be adjusted,
and the model is retrained. This iterative process continues until a satisfactory performance is
achieved. In short, classification in machine learning is all about using existing labelled data to
teach the model how to predict the class of new, unlabelled data based on the patterns it has
learned.
Classification Algorithms
There are various types of classifiers algorithms. Some of them are:
K-Nearest Neighbours (KNN) Algorithm
K-nearest neighbours (KNN) algorithm is a type of supervised ML algorithm which can be
used for both classification as well as regression predictive problems. However, it is mainly
used for classification predictive problems in industry. The main idea behind KNN is to find
the k-nearest data points to a given test data point and use these nearest neighbours to make a
prediction. The value of k is a hyperparameter that needs to be tuned, and it represents the
number of neighbours to consider.
For classification problems, the KNN algorithm assigns the test data point to the class that
appears most frequently among the k-nearest neighbours. In other words, the class with the
highest number of neighbours is the predicted class.
The following two properties would define KNN well −
• Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have
a specialized training phase and uses all the data for training while classification.
• Non-parametric learning algorithm − KNN is also a non-parametric learning
algorithm because it doesn't assume anything about the underlying data.
Working of K-Nearest Neighbours Algorithm
K-nearest neighbours (KNN) algorithm uses 'feature similarity' to predict the values of new
datapoints which further means that the new data point will be assigned a value based on how
closely it matches the points in the training set. We can understand its working with the help of
following steps −
• Step 1 − For implementing any algorithm, we need dataset. So during the first step of
KNN, we must load the training as well as test data.
• Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be
any integer.
• Step 3 − For each point in the test data do the following −
3.1 − Calculate the distance between test data and each row of training data with the
help of any of the method namely: Euclidean, Manhattan or Hamming distance. The
most commonly used method to calculate distance is Euclidean.
3.2 − Now, based on the distance value, sort them in ascending order.
3.3 − Next, it will choose the top K rows from the sorted array.
3.4 − Now, it will assign a class to the test point based on most frequent class of
these rows.
• Step 4 − End
Example
The following is an example to understand the concept of K and working of KNN algorithm −
Suppose we have a dataset which can be plotted as follows −
Now, we need to classify new data point with black dot (at point 60,60) into blue or red class.
We are assuming K = 3 i.e. it would find three nearest data points. It is shown in the next
diagram
We can see in the above diagram the three nearest neighbours of the data point with black dot.
Among those three, two of them lies in red class hence the black dot will also be assigned in
red class.
Pros of KNN
• It is very simple algorithm to understand and interpret.
• It is very useful for nonlinear data because there is no assumption about data in this
algorithm.
• It is a versatile algorithm as we can use it for classification as well as regression.
• It has relatively high accuracy but there are much better supervised learning models
than KNN.
Cons of KNN
• It is computationally a bit expensive algorithm because it stores all the training data.
• High memory storage required as compared to other supervised learning algorithms.
• Prediction is slow in case of big N.
• It is very sensitive to the scale of data as well as irrelevant features.
Applications of KNN
The following are some of the areas in which KNN can be applied successfully
Banking System
KNN can be used in banking system to predict weather an individual is fit for loan approval?
Does that individual have the characteristics similar to the defaulters one?
Calculating Credit Ratings
KNN algorithms can be used to find an individual's credit rating by comparing with the persons
having similar traits.
Politics
With the help of KNN algorithms, we can classify a potential voter into various classes like
"Will Vote", "Will not Vote", "Will Vote to Party 'Congress', "Will Vote to Party 'BJP'.
Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.
Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks. It tries to find the best boundary known as hyperplane that
separates different classes in the data. It is useful when you want to do binary classification
like spam vs. not spam or cat vs. dog.
The main goal of SVM is to maximize the margin between the two classes. The larger the
margin the better the model performs on new and unseen data.
Key Concepts of Support Vector Machine
• Hyperplane: A decision boundary separating different classes in feature space and is
represented by the equation wx + b = 0 in linear classification.
• Support Vectors: The closest data points to the hyperplane, crucial for determining the
hyperplane and margin in SVM.
• Margin: The distance between the hyperplane and the support vectors. SVM aims to
maximize this margin for better classification performance.
• Kernel: A function that maps data to a higher-dimensional space enabling SVM to
handle non-linearly separable data.
• Hard Margin: A maximum-margin hyperplane that perfectly separates the data without
misclassifications.
• Soft Margin: Allows some misclassifications by introducing slack variables, balancing
margin maximization and misclassification penalties when data is not perfectly
separable.
• C: A regularization term balancing margin maximization and misclassification
penalties. A higher C value forces stricter penalty for misclassifications.
• Hinge Loss: A loss function penalizing misclassified points or margin violations and is
combined with regularization in SVM.
• Dual Problem: Involves solving for Lagrange multipliers associated with support
vectors, facilitating the kernel trick and efficient computation.
Working of Support Vector Machine Algorithm
The key idea behind the SVM algorithm is to find the hyperplane that best separates two classes
by maximizing the margin between them. This margin is the distance from the hyperplane to
the nearest data points (support vectors) on each side.
Multiple hyperplanes separate the data from two classes
The best hyperplane also known as the "hard margin" is the one that maximizes the distance
between the hyperplane and the nearest data points from both classes. This ensures a clear
separation between the classes. So, from the above figure, we choose L2 as hard margin. Let's
consider a scenario like shown below:
Selecting hyperplane for data with outlier
Here, we have one blue ball in the boundary of the red ball.
How does SVM classify the data?
The blue ball in the boundary of red ones is an outlier of blue balls. The SVM algorithm has
the characteristics to ignore the outlier and finds the best hyperplane that maximizes the margin.
SVM is robust to outliers. Hyperplane which is the most optimized one
A soft margin allows for some misclassifications or violations of the margin to improve
generalization. The SVM optimizes the following equation to balance margin maximization
and penalty minimization:
The penalty used for violations is often hinge loss which has the following behaviour:
• If a data point is correctly classified and within the margin there is no penalty (loss =
0).
• If a point is incorrectly classified or violates the margin the hinge loss increases
proportionally to the distance of the violation.
Till now we were talking about linearly separable data that separates group of blue balls and
red balls by a straight line/linear line.
Types of Support Vector Machine
Based on the nature of the decision boundary, Support Vector Machines (SVM) can be divided
into two main parts:
• Linear SVM: Linear SVMs use a linear decision boundary to separate the data points
of different classes. When the data can be precisely linearly separated, linear SVMs are
very suitable. This means that a single straight line (in 2D) or a hyperplane (in higher
dimensions) can entirely divide the data points into their respective classes. A
hyperplane that maximizes the margin between the classes is the decision boundary.
• Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel
functions, nonlinear SVMs can handle nonlinearly separable data. The original input
data is transformed by these kernel functions into a higher-dimensional feature space
where the data points can be linearly separated. A linear SVM is used to locate a
nonlinear decision boundary in this modified space.
Advantages of Support Vector Machine (SVM)
1. High-Dimensional Performance: SVM excels in high-dimensional spaces, making it
suitable for image classification and gene expression analysis.
2. Nonlinear Capability: Utilizing kernel functions like RBF and polynomial SVM
effectively handles nonlinear relationships.
3. Outlier Resilience: The soft margin feature allows SVM to ignore outliers, enhancing
robustness in spam detection and anomaly detection.
4. Binary and Multiclass Support: SVM is effective for both binary classification and
multiclass classification suitable for applications in text classification.
5. Memory Efficiency: It focuses on support vectors making it memory efficient
compared to other algorithms.
Disadvantages of Support Vector Machine (SVM)
1. Slow Training: SVM can be slow for large datasets, affecting performance in SVM in
data mining tasks.
2. Parameter Tuning Difficulty: Selecting the right kernel and adjusting parameters like
C requires careful tuning, impacting SVM algorithms.
3. Noise Sensitivity: SVM struggles with noisy datasets and overlapping classes, limiting
effectiveness in real-world scenarios.
4. Limited Interpretability: The complexity of the hyperplane in higher dimensions
makes SVM less interpretable than other models.
5. Feature Scaling Sensitivity: Proper feature scaling is essential, otherwise SVM models
may perform poorly.
Decision Tree
A decision tree is a graphical representation of different options for solving a problem and show
how different factors are related. It has a hierarchical tree structure starts with one main
question at the top called a node which further branches out into different possible outcomes
where:
• Root Node is the starting point that represents the entire dataset.
• Branches: These are the lines that connect nodes. It shows the flow from one decision
to another.
• Internal Nodes are Points where decisions are made based on the input features.
• Leaf Nodes: These are the terminal nodes at the end of branches that represent final
outcomes or predictions
Decision Tree Structure
They also support decision-making by visualizing outcomes. You can quickly evaluate and
compare the "branches" to determine which course of action is best for you.
Now, let’s take an example to understand the decision tree. Imagine you want to decide whether
to drink coffee based on the time of day and how tired you feel. First the tree checks the time
of day—if it’s morning it asks whether you are tired. If you’re tired the tree suggests drinking
coffee if not it says there’s no need. Similarly in the afternoon the tree again asks if you are
tired. If you recommend drinking coffee if not it concludes no coffee is needed.
Classification of Decision Tree
We have mainly two types of decision tree based on the nature of the target
variable: classification trees and regression trees.
• Classification trees: They are designed to predict categorical outcomes means they
classify data into different classes. They can determine whether an email is "spam" or
"not spam" based on various features of the email.
• Regression trees: These are used when the target variable is continuous It predict
numerical values rather than categories. For example, a regression tree can estimate the
price of a house based on its size, location, and other features.
Working of Decision Tree
A decision tree working starts with a main question known as the root node. This question is
derived from the features of the dataset and serves as the starting point for decision-making.
From the root node, the tree asks a series of yes/no questions. Each question is designed to split
the data into subsets based on specific attributes. For example, if the first question is "Is it
raining?", the answer will determine which branch of the tree to follow. Depending on the
response to each question you follow different branches. If your answer is "Yes," you might
proceed down one path if "No," you will take another path.
This branching continues through a sequence of decisions. As you follow each branch, you get
more questions that break the data into smaller groups. This step-by-step process continues
until you have no more helpful questions.
You reach at the end of a branch where you find the final outcome or decision. It could be a
classification (like "spam" or "not spam") or a prediction (such as estimated price).
Advantages of Decision Trees
• Simplicity and Interpretability: Decision trees are straightforward and easy to
understand. You can visualize them like a flowchart which makes it simple to see how
decisions are made.
• Versatility: It means they can be used for different types of tasks can work well for
both classification and regression
• No Need for Feature Scaling: They don’t require you to normalize or scale your data.
• Handles Non-linear Relationships: It is capable of capturing non-linear relationships
between features and target variables.
Disadvantages of Decision Trees
• Overfitting: Overfitting occurs when a decision tree captures noise and details in the
training data and it perform poorly on new data.
• Instability: instability means that the model can be unreliable slight variations in input
can lead to significant differences in predictions.
• Bias towards Features with More Levels: Decision trees can become biased towards
features with many categories focusing too much on them during decision-making. This
can cause the model to miss out other important features led to less accurate predictions.
Naive Bayes Algorithm
The Naive Bayes algorithm is a classification algorithm based on Bayes' theorem. The
algorithm assumes that the features are independent of each other, which is why it is called
"naive." It calculates the probability of a sample belonging to a particular class based on the
probabilities of its features. For example, a phone may be considered as smart if it has touch-
screen, internet facility, good camera, etc. Even if all these features are dependent on each
other, but all these features independently contribute to the probability of that the phone is a
smart phone.
In Bayesian classification, the main interest is to find the posterior probabilities i.e. the
probability of a label given some observed features, P(L | features). With the help of Bayes
theorem, we can express this in quantitative form as follows :
In the Naive Bayes algorithm, we use Bayes' theorem to calculate the probability of a sample
belonging to a particular class. We calculate the probability of each feature of the sample given
the class and multiply them to get the likelihood of the sample belonging to the class. We then
multiply the likelihood with the prior probability of the class to get the posterior probability of
the sample belonging to the class. We repeat this process for each class and choose the class
with the highest probability as the class of the sample.
Types of Naive Bayes Algorithm
There are many types of Naive Bayes Algorithm. Here we discuss the following three types −
Gaussian Nave Bayes
Gaussian Nave Bayes is the simplest Nave Bayes classifier having the assumption that the data
from each label is drawn from a simple Gaussian distribution. It is used when the features are
continuous variables that follow a normal distribution.
Multinomial Nave Bayes
Multinomial Naive Bayes is used when features represent the frequency of terms (such as word
counts) in a document. It is commonly applied in text classification, where term frequencies
are important.
Bernoulli Nave Bayes
Bernoulli Naive Bayes deals with binary features, where each feature indicates whether a word
appears or not in a document. It is suited for scenarios where the presence or absence of terms
is more relevant than their frequency. Both models are widely used in document classification
tasks
Pros
The followings are some pros of using Nave Bayes classifiers −
• Nave Bayes classification is easy to implement and fast.
• It will converge faster than discriminative models like logistic regression.
• It requires less training data.
• It is highly scalable in nature, or they scale linearly with the number of predictors and
data points.
• It can make probabilistic predictions and can handle continuous as well as discrete data.
• Nave Bayes classification algorithm can be used for binary as well as multi-class
classification problems both.
Cons
The followings are some cons of using Nave Bayes classifiers −
• One of the most important cons of Nave Bayes classification is its strong feature
independence because in real life it is almost impossible to have a set of features which
are completely independent of each other.
• Another issue with Nave Bayes classification is its 'zero frequency' which means that
if a categorial variable has a category but not being observed in training data set, then
Nave Bayes model will assign a zero probability to it and it will be unable to make a
prediction.
Applications of Nave Bayes classification
The following are some common applications of Nave Bayes classification −
Real-time prediction − Due to its ease of implementation and fast computation, it can be used
to do prediction in real-time.
Multi-class prediction − Nave Bayes classification algorithm can be used to predict posterior
probability of multiple classes of target variable.
Text classification − Due to the feature of multi-class prediction, Nave Bayes classification
algorithms are well suited for text classification. That is why it is also used to solve problems
like spam-filtering and sentiment analysis.
Recommendation system − Along with the algorithms like collaborative filtering, Nave Bayes
makes a Recommendation system which can be used to filter unseen information and to predict
weather a user would like the given resource or not.
Kernel Function
Kernel Function is a method used to take data as input and transform it into the required
form of processing data. ". Different algorithm uses different type of kernel functions. These
functions are of different types. For example, Linear, Polynomial, Gaussian etc. We can define
the Kernel function as:
This function is 1 inside a closed ball of radius 1 cantered at the origin and 0 outside. It works
like a switch: on (1) inside the ball and off (0) outside. just like shown in figure:
Types of Kernels used in SVM
Here are some common types of kernels used by SVM. Let's understand them one by one:
1. Linear Kernel
• A linear kernel is the simplest form of kernel used in SVM. It is suitable when the data
is linearly separable meaning that a straight line (or hyperplane in higher dimensions)
can effectively separate the classes.
• It is represented as: K(x,y) = x.y
• It is used for text classification problems such as spam detection
2. Polynomial Kernel
• The polynomial kernel allows SVM to model more complex relationships by
introducing polynomial terms. It is useful when the data is not linearly separable but
still follows a pattern. The formula of Polynomial kernel is:
• where is a constant and d is the polynomial degree.
• It is used in Complex problems like image recognition where relationships between
features can be non-linear.
Choosing the Right Kernel for SVM
Picking the right kernel for an SVM (Support Vector Machine) model is very important because
it affects how well the model works. Here’s a simple guide to help you choose the right kernel:
1. What the Data Looks Like:
• If the data can be separated by a straight line we use a linear kernel.
• If the data is messy and needs a more complex boundary use a non-linear
kernel like RBF (Radial Basis Function) or polynomial kernels.
2. How Fast You Need the Model:
• Linear kernels are faster and use less computer power.
• Non-linear kernels like RBF take more time and resources.
3. How Easy It Is to Understand the Model:
• Linear kernels are easier to understand because the boundary is simple.
• Non-linear kernels create complex boundaries and make the model harder to
understand.
4. Tuning the Model:
• Each kernel has special settings called hyperparameters that you can adjust to
get the best performance.
• You will need to try different combinations of these settings using cross-
validation to find the best one.
Real World Applications of SVM Kernels
• Linear kernels are commonly used in credit scoring and fraud detection models
because they are fast, easy to implement and produce interpretable results.
• Polynomial kernels are frequently applied in image classification tasks to identify
objects or patterns in images. They help capture the complex relationships between
pixel features, making them suitable for tasks like facial recognition or object detection.
• In text analysis such as sentiment analysis (classifying text as positive, negative, or
neutral) SVMs with various kernels can handle different types of text data. Non-linear
kernels especially RBF
• SVM kernels are used to diagnose diseases predict patient outcomes and identify
patterns in medical data.
Perceptron
Perceptron is a type of neural network that performs binary classification that maps input
features to an output decision, usually classifying data into one of two categories, such as 0 or
1.
Perceptron consists of a single layer of input nodes that are fully connected to a layer of output
nodes. It is particularly good at learning linearly separable patterns. It utilizes a variation of
artificial neurons called Threshold Logic Units (TLU), which were first introduced by
McCulloch and Walter Pitts in the 1940s. This foundational model has played a crucial role in
the development of more advanced neural networks and machine learning algorithms.
Types of Perceptron
Single-Layer Perceptron: It is one of the oldest and first introduced neural networks.
It was proposed by Frank Rosenblatt in 1958. Perceptron is also known as an artificial
neural network. Perceptron is mainly used to compute the logical gate like AND, OR
and NOR which has binary input and binary output.
The main functionality of the perceptron is:
• Takes input from the input layer
• Weight them up and sum it up.
• Pass the sum to the nonlinear function to produce the output.
Here activation functions can be anything like sigmoid, tanh, relu based on the
requirement we will be choosing the most appropriate nonlinear activation function to
produce the better result. Now let us implement a single-layer perceptron.
Multi-Layer Perceptron: Multi-Layer Perceptron (MLP) consists of fully
connected dense layers that transform input data from one dimension to another. It is
called multi-layer because it contains an input layer, one or more hidden layers and an
output layer. The purpose of an MLP is to model complex relationships between inputs
and outputs.
Components of Multi-Layer Perceptron (MLP)
Input Layer: Each neuron or node in this layer corresponds to an input feature. For
instance, if you have three input features the input layer will have three neurons.
Hidden Layers: MLP can have any number of hidden layers with each layer containing
any number of nodes. These layers process the information received from the input
layer.
Output Layer: The output layer generates the final prediction or result. If there are
multiple outputs, the output layer will have a corresponding number of neurons.
Every connection in the diagram is a representation of the fully connected nature of an
MLP. This means that every node in one layer connects to every node in the next layer.
As the data moves through the network each layer transforms it until the final output is
generated in the output layer.
Basic Components of Perceptron
A Perceptron is composed of key components that work together to process information and
make predictions.
• Input Features: The perceptron takes multiple input features, each representing a
characteristic of the input data.
• Weights: Each input feature is assigned a weight that determines its influence on the
output. These weights are adjusted during training to find the optimal values.
• Summation Function: The perceptron calculates the weighted sum of its inputs,
combining them with their respective weights.
• Activation Function: The weighted sum is passed through the Heaviside step
function, comparing it to a threshold to produce a binary output (0 or 1).
• Output: The final output is determined by the activation function, often used
for binary classification tasks.
• Bias: The bias term helps the perceptron make adjustments independent of the input,
improving its flexibility in learning.
• Learning Algorithm: The perceptron adjusts its weights and bias using a learning
algorithm, such as the Perceptron Learning Rule, to minimize prediction errors.
These components enable the perceptron to learn from data and make predictions. While a
single perceptron can handle simple binary classification, complex tasks require multiple
perceptron’s organized into layers, forming a neural network.
Working of Perceptron
A weight is assigned to each input node of a perceptron, indicating the importance of that input
in determining the output. The Perceptron’s output is calculated as a weighted sum of the
inputs, which is then passed through an activation function to decide whether the Perceptron
will fire. The weighted sum is computed as:
The step function compares this weighted sum to a threshold. If the input is larger than the
threshold value, the output is 1; otherwise, it's 0. This is the most common activation function
used in Perceptron are represented by the Heaviside step function:
A perceptron consists of a single layer of Threshold Logic Units (TLU), with each TLU fully
connected to all input nodes.
In a fully connected layer, also known as a dense layer, all neurons in one layer are connected
to every neuron in the previous layer.
The output of the fully connected layer is computed as:
where X is the input WW is the weight for each inputs neurons and bb is the bias and h is the
step function.
During training, the Perceptron's weights are adjusted to minimize the difference between the
predicted output and the actual output. This is achieved using supervised learning algorithms
like the delta rule or the Perceptron learning rule.
The weight update formula is:
Neural networks
Neural networks are machine learning models that mimic the complex functions of the human
brain. These models consist of interconnected nodes or neurons that process data, learn
patterns, and enable tasks such as pattern recognition and decision-making.
In this article, we will explore the fundamentals of neural networks, their architecture, how
they work, and their applications in various fields. Understanding neural networks is essential
for anyone interested in the advancements of artificial intelligence.
Understanding Neural Networks in Deep Learning
Neural networks are capable of learning and identifying patterns directly from data without
pre-defined rules. These networks are built from several key components:
1. Neurons: The basic units that receive inputs, each neuron is governed by a threshold
and an activation function.
2. Connections: Links between neurons that carry information, regulated by weights and
biases.
3. Weights and Biases: These parameters determine the strength and influence of
connections.
4. Propagation Functions: Mechanisms that help process and transfer data across layers
of neurons.
5. Learning Rule: The method that adjusts weights and biases over time to improve
accuracy.
Learning in neural networks follows a structured, three-stage process:
1. Input Computation: Data is fed into the network.
2. Output Generation: Based on the current parameters, the network generates an output.
3. Iterative Refinement: The network refines its output by adjusting weights and biases,
gradually improving its performance on diverse tasks.
In an adaptive learning environment:
• The neural network is exposed to a simulated scenario or dataset.
• Parameters such as weights and biases are updated in response to new data or
conditions.
• With each adjustment, the network’s response evolves, allowing it to adapt effectively
to different tasks or environments.
Importance of Neural Networks
Neural networks are pivotal in identifying complex patterns, solving intricate challenges, and
adapting to dynamic environments. Their ability to learn from vast amounts of data is
transformative, impacting technologies like natural language processing, self-driving
vehicles, and automated decision-making.
Neural networks streamline processes, increase efficiency, and support decision-making across
various industries. As a backbone of artificial intelligence, they continue to drive innovation,
shaping the future of technology.
Layers in Neural Network Architecture
1. Input Layer: This is where the network receives its input data. Each input neuron in
the layer corresponds to a feature in the input data.
2. Hidden Layers: These layers perform most of the computational heavy lifting. A
neural network can have one or multiple hidden layers. Each layer consists of units
(neurons) that transform the inputs into something that the output layer can use.
3. Output Layer: The final layer produces the output of the model. The format of these
outputs varies depending on the specific task (e.g., classification, regression).
Working of Neural Networks
Forward Propagation
When data is input into the network, it passes through the network in the forward direction,
from the input layer through the hidden layers to the output layer.
This process is known as forward propagation. Here’s what happens during this phase:
1. Linear Transformation: Each neuron in a layer receives inputs, which are multiplied
by the weights associated with the connections. These products are summed together,
and a bias is added to the sum. This can be represented mathematically as:
where w represents the weights, x represents the inputs, and b is the bias.
2. Activation: The result of the linear transformation (denoted as z) is then passed through
an activation function. The activation function is crucial because it introduces non-
linearity into the system, enabling the network to learn more complex patterns. Popular
activation functions include ReLU, sigmoid, and tanh.
Backpropagation
After forward propagation, the network evaluates its performance using a loss function, which
measures the difference between the actual output and the predicted output. The goal of training
is to minimize this loss. This is where backpropagation comes into play:
1. Loss Calculation: The network calculates the loss, which provides a measure of error
in the predictions. The loss function could vary; common choices are mean squared
error for regression tasks or cross-entropy loss for classification.
2. Gradient Calculation: The network computes the gradients of the loss function with
respect to each weight and bias in the network. This involves applying the chain rule of
calculus to find out how much each part of the output error can be attributed to each
weight and bias.
3. Weight Update: Once the gradients are calculated, the weights and biases are updated
using an optimization algorithm like stochastic gradient descent (SGD). The weights
are adjusted in the opposite direction of the gradient to minimize the loss. The size of
the step taken in each update is determined by the learning rate.
Iteration
This process of forward propagation, loss calculation, backpropagation, and weight update are
repeated for many iterations over the dataset. Over time, this iterative process reduces the loss,
and the network's predictions become more accurate.
Through these steps, neural networks can adapt their parameters to better approximate the
relationships in the data, thereby improving their performance on tasks such as classification,
regression, or any other predictive modelling.
Back Propagation
Back Propagation is also known as "Backward Propagation of Errors" is a method used to
train neural network. Its goal is to reduce the difference between the model’s predicted output
and the actual output by adjusting the weights and biases in the network.
It works iteratively to adjust weights and bias to minimize the cost function. In each epoch the
model adapts these parameters by reducing loss by following the error gradient. It often uses
optimization algorithms like gradient descent or stochastic gradient descent. The algorithm
computes the gradient using the chain rule from calculus allowing it to effectively navigate
complex layers in the neural network to minimize the cost function.
Fig(a) A simple illustration of how the backpropagation works by adjustments of weights
Back Propagation plays a critical role in how neural networks improve over time. Here's why:
1. Efficient Weight Update: It computes the gradient of the loss function with respect to
each weight using the chain rule making it possible to update weights efficiently.
2. Scalability: The Back Propagation algorithm scales well to networks with multiple
layers and complex architectures making deep learning feasible.
3. Automated Learning: With Back Propagation the learning process becomes
automated and the model can adjust itself to optimize its performance.
Working of Back Propagation Algorithm
The Back Propagation algorithm involves two main steps: the Forward Pass and
the Backward Pass.
1. Forward Pass Work
In forward pass the input data is fed into the input layer. These inputs combined with their
respective weights are passed to hidden layers. For example, in a network with two hidden
layers (h1 and h2) the output from h1 serves as the input to h2. Before applying an activation
function, a bias is added to the weighted inputs.
Each hidden layer computes the weighted sum (`a`) of the inputs then applies an activation
function like ReLU (Rectified Linear Unit) to obtain the output (`o`). The output is passed to
the next layer where an activation function such as softmax converts the weighted outputs into
probabilities for classification.
The forward pass using
weights and biases
2. Backward Pass
In the backward pass the error (the difference between the predicted and actual output) is
propagated back through the network to adjust the weights and biases. One common method
for error calculation is the Mean Squared Error (MSE) given by:
MSE= (Predicted Output−Actual Output)
Once the error is calculated the network adjusts weights using gradients which are computed
with the chain rule. These gradients indicate how much each weight and bias should be adjusted
to minimize the error in the next iteration. The backward pass continues layer by layer ensuring
that the network learns and improves its performance. The activation function through its
derivative plays a crucial role in computing these gradients during Back Propagation.
Advantages:
• Simplicity and Intuitive Implementation:
Backpropagation is relatively straightforward to understand and implement, making it
accessible for many machine learning tasks.
• Multi-layer Network Training:
It excels at training networks with multiple layers, enabling the modelling of complex
relationships in data.
• No Hyperparameter Tuning (in some cases):
It can be used without needing to tune many parameters beyond the input data, simplifying the
initial setup.
• Well-suited for large datasets:
It's often used in situations that require a lot of training data and can effectively learn from large
datasets.
• Flexibility:
Backpropagation doesn't require prior knowledge about the network architecture.
Disadvantages:
• Sensitivity to Data Noise:
The algorithm can be easily affected by noisy or irregular input data, leading to suboptimal
performance.
• Slow Training:
Training can be computationally expensive and time-consuming, especially for large networks
and datasets.
• Local Minima Problem:
Backpropagation can get stuck in local minima, which are suboptimal solutions within the error
landscape, hindering the search for the global minimum.
• Vanishing/Exploding Gradients:
In deep networks, gradients can either become extremely small (vanishing) or extremely large
(exploding), affecting the learning process.
• Overfitting:
Backpropagation can sometimes overfit the training data, leading to poor generalization to
unseen data.
• Dependency on Input Data:
The performance of backpropagation is heavily influenced by the quality and characteristics of
the input data.