CLASSIFICATION
Instructor
Dr. Prashant Srivastava
INTRODUCTION
Classification is a type of supervised learning
where a target feature, which is of categorical
type, is predicted for the test data on the basis
of information imparted by the training data.
The target categorical feature is known as
class.
CLASSIFICATION
LEARNING STEPS
COMMON CLASSIFICATION
ALGORITHMS
k-nearest neighbour
Logistic regression
Decision tree
Random forest
Support Vector Machine
Naïve Bayes classifier
k-NEAREST NEIGHBOUR
A simple but extremely powerful classification
algorithm.
k-NEAREST NEIGHBOUR
The unknown and unlabelled data which comes
for a prediction problem is judged on the basis
of the training dataset elements which are
similar to the unknown element.
The class labels of unknown element is
assigned on the basis of class labels of the
similar training dataset elements.
k-NEAREST NEIGHBOUR
Consider a Student dataset.
k-NEAREST NEIGHBOUR
1. Leader- students having good communication
skills as well as a good level of aptitude.
2. Speaker- students having good
communication skills but not so good level of
aptitude.
3. Intel- students having not so good
communication skill but a good level of
aptitude.
k-NEAREST NEIGHBOUR
Performance of the classification model is
measured by the number of correct
classifications made by the model when applied
to an unknown dataset.
If the class value predicted for most of the test
data elements matches with the actual class
value that they have, then we say that the
classification model possesses a good accuracy.
k-NEAREST NEIGHBOUR
k-NEAREST NEIGHBOUR
Two challenges-
1. What is the basis of similarity?
2. How many similar elements should be
considered for deciding the class label of each
test data element?
k-NEAREST NEIGHBOUR
Measures of similarity.
The most common approach adopted by kNN to
measure similarity between two data elements
is Euclidean distance.
k-NEAREST NEIGHBOUR
k-NEAREST NEIGHBOUR
To find out the closest or nearest
neighbours of the test data point,
Euclidean distance of the different dots
need to be calculated from the asterisk.
Then, the class value of the closest
neighbours helps in assigning the class
value of the test data element.
k-NEAREST NEIGHBOUR
How many similar elements should be
considered?
The answer lies in the value of ‘k’ which is a
user-defined parameter given as input to the
algorithm.
In kNN algorithm, the value of ‘k’ indicates the
number of neighbours that need to be
considered.
k-NEAREST NEIGHBOUR
For example, if the value of k is 3, only three
nearest neighbours or three training data
elements closest to the test data element is
considered.
Out of the three data elements, the class which
is predominant is considered as the class label
to be assigned to the test data.
k-NEAREST NEIGHBOUR
k-NEAREST NEIGHBOUR
Choosing value of ‘k’
If the value of k is very large, the class label
of the majority class of the training data set will
be assigned to the test data regardless of the
class labels of the neighbours nearest to the
test data.
If the value of k is very small, the class
value of a noisy data or outlier in the training
dataset which is nearest neighbour to the test
data will be assigned to the test data.
k-NEAREST NEIGHBOUR
Then, what should be ‘k’?
The best value of ‘k’ is somewhere between
these two extremes.
k-NEAREST NEIGHBOUR
1. One common practice- set k equal to the
square root of the number of training records.
2. Test several ‘k’ values on a variety of test
datasets and choose the one that delivers the
best performance.
k-NEAREST NEIGHBOUR
Algorithm-
1. Predict a class value for new data: Calculate
distance(X, Xi) from i=1,2,3,….,n.
where X= new data point, Xi= training data,
distance as per your chosen distance metric.
2. Sort these distances in increasing order with
corresponding train data.
3. From this sorted list, select the top ‘K’ rows.
4. Find the most frequent class from these chosen
‘K’ rows. This will be your predicted class.
k-NEAREST NEIGHBOUR
Strengths-
1. Simple and easy to understand.
2. Effective in certain situations.
3. Almost no time required for training phase.
k-NEAREST NEIGHBOUR
Weakness-
1. Does not learn anything in real sense.
2. Classification process is very slow.
3. Large amount of computational space is
required.
k-NEAREST NEIGHBOUR
k-nearest neighbour is a ‘lazy learner’.
What is lazy learning?
k-NEAREST NEIGHBOUR
Lazy learning is a type of machine learning that
doesn't process training data until it needs to
make a prediction.
Instead of building models during training, lazy
learning algorithms wait until they encounter a
new query.
k-NEAREST NEIGHBOUR
This
method stores and compares training
examples when making predictions.
It's also called instance-based or memory-based
learning.
k-NEAREST NEIGHBOUR
Lazy learning algorithms work by memorizing
the training data rather than constructing a
general model.
When a new query is received, lazy learning
retrieves similar instances from the training set
and uses them to generate a prediction.
The similarity between instances is usually
calculated using distance metrics.
LOGISTIC REGRESSION
A type of regression analysis used for predicting
the outcome of a categorical variable.
In logistic regression, dependant variable (Y) is
binary (0,1) and independent variables (X) are
continuous in nature.
LOGISTIC REGRESSION
The goal of logistic regression is to predict the
likelihood that Y = 1 given certain value of X.
IfX and Y have a strong positive linear
relationship, the probability that a person will
have a score of Y = 1 will increase as the values
of X increase.
So, we are predicting probabilities rather than
the scores of dependent variable.
LOGISTIC REGRESSION
It’s essential to emphasize that logistic
regression is not just a classification algorithm;
it’s a method for estimating probabilities.
Logistic regression is a powerful classification
technique which estimates the likelihood of an
input belonging to a particular class.
This estimation is inherently a probability
prediction, which must be converted into binary
values (0 or 1) to make class predictions.
HOW LOGISTIC
REGRESSION WORKS?
Logistic Regression algorithm works by
implementing a linear equation with
independent or explanatory variables to predict
a response value.
For example, we consider the example of
number of hours studied and probability of
passing the exam.
Here, number of hours studied is the
explanatory variable and it is denoted by x1.
Probability of passing the exam is the response
or target variable and it is denoted by z.
HOW LOGISTIC
REGRESSION WORKS?
If we have one explanatory variable (x) and one
response variable (z), then the linear equation
would be given mathematically with the
following equation-
z = a + bx
SIGMOID FUNCTION
This predicted response value, denoted by z is
then converted into a probability value that lie
between 0 and 1.
We use the sigmoid function in order to map
predicted values to probability values.
This sigmoid function then maps any real value
into a probability value between 0 and 1.
SIGMOID FUNCTION
Sigmoid function is used to map predictions to
probabilities. The sigmoid function has an S
shaped curve.
It is also called sigmoid curve.
A Sigmoid function is a special case of the
Logistic function.
SIGMOID FUNCTION
DECISION BOUNDARY
The sigmoid function returns a probability value
between 0 and 1.
This probability value is then mapped to a
discrete class which is either “0” or “1”.
DECISION BOUNDARY
In order to map this probability value to a
discrete class (pass/fail, yes/no, true/false), we
select a threshold value.
Thisthreshold value is called Decision
boundary.
Above this threshold value, we will map the
probability values into class 1 and below which
we will map values into class 0.
DECISION BOUNDARY
Mathematically, it can be expressed as follows:-
p ≥ 0.5 => class = 1
p < 0.5 => class = 0
DECISION BOUNDARY
Generally, the decision boundary is set to 0.5.
So, if the probability value is 0.8 (> 0.5), we will
map this observation to class 1.
Similarly, if the probability value is 0.2 (< 0.5),
we will map this observation to class 0.
DECISION BOUNDARY
SUPPORT VECTOR
MACHINE
Support Vector Machine (SVM) is a model,
which can do linear classification as well as
regression.
SVM is based on the concept of a surface,
called a hyperplane, which draws a boundary
between data instances plotted in the
multidimensional feature space.
SUPPORT VECTOR
MACHINE
The output prediction of an SVM is one of the
two conceivable classes which are already
defined in the training data.
SVM algorithm builds an N-dimensional
hyperplane model that assigns future instances
into one of the two possible output classes.
SUPPORT VECTOR
MACHINE
The goal of the SVM algorithm is to create the
best line or decision boundary that can
segregate n-dimensional space into classes so
that we can easily put the new data point in the
correct category in the future.
Thisbest decision boundary is called a
hyperplane.
SUPPORT VECTOR
MACHINE
SUPPORT VECTOR
MACHINE
There may be many possible hyperplanes.
One of the challenges with the SVM model is to
find the optimal hyperplane.
SUPPORT VECTOR
MACHINE
SUPPORT VECTOR
MACHINE
There can be multiple hyperplanes to segregate
the classes in n-dimensional space, but we
need to find out the best hyperplane that helps
to classify the data points.
We always create a hyperplane that has a
maximum margin, which means the maximum
distance between the data points.
SUPPORT VECTOR
MACHINE
The data points or vectors that are the closest
to the hyperplane and which affect the position
of the hyperplane are termed as Support
Vector.
Since these vectors support the hyperplane,
hence called a support vector.
SUPPORT VECTOR
MACHINE
SUPPORT VECTOR
MACHINE
Hyperplane and Margin-
Mathematically, in a two dimensional space,
hyperplane can be defined by
SUPPORT VECTOR
MACHINE
Extending this concept to an N-dimensional
space, hyperplane can be defined by the
equation
In short, it can be represented as follows-
SUPPORT VECTOR
MACHINE
The farther the data points lie from the
hyperplane, the more confident we are about
correct categorization.
The distance between hyperplane and data
points is known as margin.
SUPPORT VECTOR
MACHINE
Maximum Margin Hyperplane-
Refers to identifying the hyperplane which has
the largest separation with the data instances
of the two classes.
Though any set of three hyperplanes can do the
correct classification, why do we need to search
for the set of hyperplanes causing the largest
separation?
SUPPORT VECTOR
MACHINE
The answer is that doing so helps us in
achieving more generalization and hence less
number of issues in the classification of
unknown data.
SUPPORT VECTOR
MACHINE
Identifying MMH for linearly separable data
An outer boundary needs to be drawn for the
data instances belonging to different classes.
These outer boundaries are known as convex
hull.
The MMH can be drawn as the perpendicular
bisector of the shortest line between convex
hull.
SUPPORT VECTOR
MACHINE
SUPPORT VECTOR
MACHINE
Find a set of values for the vector c such that
two hyperplanes, represented by the equations
below, can be specified
SUPPORT VECTOR
MACHINE
All the instances that belong to one class falls
above one hyperplane and all the data
instances belonging to another class falls below
another hyperplane.
According to vector geometry, the distance of
these planes should be
SUPPORT VECTOR
MACHINE
In order to maximize the distance between
hyperplanes, the value of vector c should be
minimised.
The task of SVM is to solve the optimization
problem
SUPPORT VECTOR
MACHINE
Strengths-
1. Can be used for both classification and
regression
2. Not much impacted by data with noise or
outlier
3. Prediction using this model is promising
SUPPORT VECTOR
MACHINE
Weaknesses-
1. Applicable only for binary classification
2. Slow for large datasets
3. Quite memory-intensive
DECISION TREE
One of the most widely adopted algorithms for
classification.
It builds model in the form of tree structure.
Exceptionally productive.
DECISION TREE
Used for multidimensional analysis with
multiple classes.
The goal of decision tree learning is to create a
model based on past data called past vector
that predicts the value of the output variable
based on input variables in the feature vector.
DECISION TREE
Each node of a decision tree corresponds to one
of the feature vector.
For every node there are edges to children,
wherein there is an edge for each of the
possible values (or range of values) of the
feature associated with the node.
DECISION TREE
The tree terminates at different leaf nodes
where each leaf node represents a possible
value for the output variable.
The output variable is determined by following
a path that starts at the root and is guided by
the values of the input variables.
BUILDING A DECISION
TREE
It starts from the root node, which is nothing
but the entire dataset.
It selects the feature which predicts the target
class in the strongest way.
BUILDING A DECISION
TREE
The decision tree splits the dataset into multiple
partitions, with data in each partition having a
distinct value for the feature based on which
the partitioning has happened.
This is the first set of branches.
BUILDING A DECISION
TREE
Likewise, the algorithm continues splitting the
nodes on the basis of the feature which helps in
the best partition.
This continues till a stopping criteria is reached.
BUILDING A DECISION
TREE
The usual stopping criteria are-
1. All or most of the examples at a particular
node have the same class
2. All features have been used up in the
partitioning
3. The tree has grown to a pre-defined
threshold limit
BUILDING A DECISION
TREE
There are many implementations of decision
tree.
The biggest challenge of a decision tree
algorithm is to find out which feature to split
upon.
BUILDING A DECISION
TREE
Data should be split in such a way that the
partitions created by the split should contain
examples belonging to a single class.
If that happens, the partitions are considered as
pure.
BUILDING A DECISION
TREE
Entropy is the measure of impurity of an
attribute or feature.
The information gain is calculated on the
basis of decrease in entropy (S) after a dataset
is split according to a particular attribute (A).
BUILDING A DECISION
TREE
Constructing a tree is all about finding an
attribute that returns the highest
information gain.
REFERENCES
1. Ethem Alpaydin. Machine learning. MIT press,
2021.
2. S Dutt, S Chandramouli, A K Das. Machine
Learning. Pearson, 2022.