KEMBAR78
Classification and Clustering Algorithm Notes | PDF | Support Vector Machine | Statistical Classification
0% found this document useful (0 votes)
9 views19 pages

Classification and Clustering Algorithm Notes

The document provides an overview of classification algorithms, a supervised learning technique used to categorize new observations based on training data. It details various types of classification algorithms, including Logistic Regression, Naive Bayes, K-Nearest Neighbors, Decision Trees, Random Forests, and Support Vector Machines, along with their evaluation methods. Additionally, it briefly mentions K-Means clustering as a separate technique for grouping data points into clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views19 pages

Classification and Clustering Algorithm Notes

The document provides an overview of classification algorithms, a supervised learning technique used to categorize new observations based on training data. It details various types of classification algorithms, including Logistic Regression, Naive Bayes, K-Nearest Neighbors, Decision Trees, Random Forests, and Support Vector Machines, along with their evaluation methods. Additionally, it briefly mentions K-Means clustering as a separate technique for grouping data points into clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

The Classification algorithm is a Supervised Learning technique that is

used to identify the category of new observations on the basis of


training data. In Classification, a program learns from the given dataset
or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or
dog, etc. Classes can be called as targets/labels or categories. Unlike
regression, the output variable of Classification is a category, not a
value, such as "Green or Blue", "fruit or animal", etc. Since the
Classification algorithm is a Supervised learning technique, hence it
takes labelled input data, which means it contains input with the
corresponding output.

Classification Algorithms can be further divided into the Mainly two category:

o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

Learners in Classification Problems


There are two types of learners.

 Lazy Learners

It first stores the training dataset before waiting for the test dataset to arrive.
When using a lazy learner, the classification is carried out using the training
dataset's most appropriate data. Less time is spent on training, but more
time is spent on predictions. Some of the examples are case-based
reasoning and the KNN algorithm.

 Eager Learners

Before obtaining a test dataset, eager learners build a classification model


using a training dataset. They spend more time studying and less time
predicting. Some of the examples are ANN, naive Bayes, and Decision trees.
Types of Classification Algorithms

1. Logistic Regression

It is a supervised learning classification technique that forecasts the


likelihood of a target variable. There will only be a choice between two
classes. Data can be coded as either one or yes, representing success, or as
0 or no, representing failure. The dependent variable can be predicted most
effectively using logistic regression. When the forecast is categorical, such as
true or false, yes or no, or a 0 or 1, you can use it. A logistic regression
technique can be used to determine whether or not an email is a spam.

2. Naive Byes

Naive Bayes determines whether a data point falls into a particular category.
It can be used to classify phrases or words in text analysis as either falling
within a predetermined classification or not. It assumes that predictors in a
dataset are independent. This means that it assumes the features are
unrelated to each other. For example, if given a banana, the classifier will
see that the fruit is of yellow color, oblong-shaped and long and tapered. All
of these features will contribute independently to the probability of it being a
banana and are not dependent on each other. Naive Bayes is based on
Bayes’ theorem, which is given as:

Figure 3 : Bayes’ Theorem

Where :

P(A | B) = how often happens given that B happens

P(A) = how likely A will happen

P(B) = how likely B will happen

P(B | A) = how often B happens given that A happens


Text Tag

“A great game” Sports

“The election is over” Not Sports

“What a great score” Sports

“A clean and unforgettable game” Sports

“The spelling bee winner was a surprise” Not Sports

3. K-Nearest Neighbors

It calculates the likelihood that a data point will join the groups based on
which group the data points closest to it are a part of. When using k-NN for
classification, you determine how to classify the data according to its nearest
neighbor.

The parameter k in kNN refers to the number of labeled points (neighbors)


considered for classification. The value of k indicates the number of these
points used to determine the result. Our task is to calculate the distance and
identify which categories are closest to our unknown entity.

Given a point whose class we do not know, we can try to understand which
points in our feature space are closest to it. These points are the k-nearest
neighbors. Since similar things occupy similar places in feature space, it’s
very likely that the point belongs to the same class as its neighbors. Based on
that, it’s possible to classify a new point as belonging to one class or another.

Some advanced methods for selecting k that are suitable for these cases.

1. Square root method

The optimal K value can be calculated as the square root of the total number
of samples in the training dataset. Use an error plot or accuracy plot to find
the most favorable K value. KNN performs well with multi-label classes, but in
case of the structure with many outliers, it can fail, and you’ll need to use
other methods.

2. Cross-validation method (Elbow method)

Begin with k=1, then perform cross-validation (5 to 10 fold – these figures are
common practice as they provide a good balance between the computational
efforts and statistical validity), and evaluate the accuracy. Keep repeating the
same steps until you get consistent results. As k goes up, the error usually
decreases, then stabilizes, and then grows again. The optimal k lies at the
beginning of the stable zone.

K-distance is the distance between data points and a given query point. To
calculate it, we have to pick a distance metric.
Some of the most popular metrics are explained below.

Euclidean distance

The Euclidean distance between two points is the length of the straight line
segment connecting them. This most common distance metric is applied to
real-valued vectors.

Manhattan distance

The Manhattan distance between two points is the sum of the absolute
differences between the x and y coordinates of each point. Used to measure
the minimum distance by summing the length of all the intervals needed to get
from one location to another in a city, it’s also known as the taxicab distance.
Minkowski distance

Minkowski distance generalizes the Euclidean and Manhattan distances. It


adds a parameter called “order” that allows different distance measures to be
calculated. Minkowski distance indicates a distance between two points in a
normed vector space

Hamming distance

Hamming distance is used to compare two binary vectors (also called data
strings or bitstrings). To calculate it, data first has to be translated into a binary
system.

4. Decision Tree

A decision tree is an example of supervised learning. Although it can solve


regression and classification problems, it excels in classification problems.
Similar to a flow chart, it divides data points into two similar groups at a
time, starting with the "tree trunk" and moving through the "branches" and
"leaves" until the categories are more closely related to one another.
Decision tree builds classification or regression models in the form of a tree
structure. It breaks down a dataset into smaller and smaller subsets while at
the same time an associated decision tree is incrementally developed. The
final result is a tree with decision nodes and leaf nodes. A decision node
(e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy).
Leaf node (e.g., Play) represents a classification or decision. The topmost
decision node in a tree which corresponds to the best predictor called root
node. Decision trees can handle both categorical and numerical data.
Decision tree includes all predictors with the dependence assumptions between predictors.

Entropy
A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain
instances with similar values (homogenous). ID3 algorithm uses entropy to calculate the homogeneity of a sam
the sample is completely homogeneous the entropy is zero and if the sample is an equally divided it has entrop
one.

To build a decision tree, we need to calculate two types of entropy using frequency tables as follows:
a) Entropy using the frequency table of one attribute:

b) Entropy using the frequency table of two attributes:

Information Gain
The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a
decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneou
branches).

Step 1: Calculate entropy of the target.


Step 2: The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is
proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before th
The result is the Information Gain, or decrease in entropy.

Step 3: Choose attribute with the largest information gain as the decision node, divide the dataset by its branch
repeat the same process on every branch.
Step 4a: A branch with entropy of 0 is a leaf node.

Step 4b: A branch with entropy more than 0 needs further splitting.

Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.
5. Random Forest Algorithm

The random forest algorithm is an extension of the Decision Tree algorithm


where you first create a number of decision trees using training data and
then fit your new data into one of the created ‘tree’ as a ‘random forest’. It
averages the data to connect it to the nearest tree data based on the data
scale. These models are great for improving the decision tree’s problem of
forcing data points unnecessarily within a category.

The following steps explain the working Random Forest Algorithm:

Step 1: Select random samples from a given data or training set.


Step 2: This algorithm will construct a decision tree for every training data.

Step 3: Voting will take place by averaging the decision tree.

Step 4: Finally, select the most voted prediction result as the final prediction
result.

This combination of multiple models is called Ensemble. Ensemble uses two


methods:

1. Bagging: Creating a different training subset from sample training


data with replacement is called Bagging. The final output is based
on majority voting.

2. Boosting: Combing weak learners into strong learners by creating


sequential models such that the final model has the highest
accuracy is called Boosting. Example: ADA BOOST, XG BOOST.

6. Support Vector Machine


Support Vector Machine is a popular supervised machine learning technique
for classification and regression problems. It goes beyond X/Y prediction by
using algorithms to classify and train the data according to polarity.

The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily
put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the


hyperplane. These extreme cases are called as support vectors, and hence
algorithm is termed as Support Vector Machine. Consider the below diagram
in which there are two different categories that are classified using a
decision boundary or hyperplane:

Evaluating a Classification model:


Once our model is completed, it is necessary to evaluate its performance;
either it is a Classification or Regression model. So for evaluating a
Classification model, we have the following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output


is a probability value between the 0 and 1.
o For a good binary Classification model, the value of log loss should be
near to 0.
o The value of log loss increases if the predicted value deviates from the
actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:
1. ?(ylog(p)+(1?y)log(1?p))

Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and


describes the performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which
has a total number of correct predictions and incorrect predictions. The
matrix looks like as below table:
Actual Positive Actual Negative
o

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics


Curve and AUC stands for Area Under the Curve.
o It is a graph that shows the performance of the classification model at
different thresholds.
o To visualize the performance of the multi-class classification model, we
use the AUC-ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive
Rate) on Y-axis and FPR(False Positive Rate) on X-axis.

K-Means Clustering Algorithm

The K-means clustering algorithm computes centroids and repeats


until the optimal centroid is found. It is presumptively known how
many clusters there are. It is also known as the flat clustering
algorithm. The number of clusters found from data by the method is
denoted by the letter ‘K’ in K-means.

In this method, data points are assigned to clusters in such a way


that the sum of the squared distances between the data points and
the centroid is as small as possible. It is essential to note that
reduced diversity within clusters leads to more identical data points
within the same cluster.

The following stages will help us understand how the K-Means


clustering technique works-
 Step 1: First, we need to provide the number of clusters, K,
that need to be generated by this algorithm.
 Step 2: Next, choose K data points at random and assign each
to a cluster. Briefly, categorize the data based on the number
of data points.
 Step 3: The cluster centroids will now be computed.
 Step 4: Iterate the steps below until we find the ideal centroid,
which is the assigning of data points to clusters that do not
vary.
 4.1 The sum of squared distances between data points and
centroids would be calculated first.
 4.2 At this point, we need to allocate each data point to the
cluster that is closest to the others (centroid).
 4.3 Finally, compute the centroids for the clusters by averaging
all of the cluster’s data points.

Applications of K-Means clustering


 To get relevant insights from the data we’re dealing with.
 Distinct models will be created for different subgroups in a
cluster-then-predict approach.
 Market segmentation
 Document Clustering
 Image segmentation
 Image compression
 Customer segmentation
 Analysing the trend on dynamic data

You might also like