KEMBAR78
ML Unsupervised Notes-New | PDF | Cluster Analysis | Machine Learning
100% found this document useful (1 vote)
224 views13 pages

ML Unsupervised Notes-New

Ml pdf

Uploaded by

insta.secrty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
224 views13 pages

ML Unsupervised Notes-New

Ml pdf

Uploaded by

insta.secrty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

GATE in Data Science and AI study material GATE in Data Science and AI study material

GATE in Data Science and AI Study Materials 1 Intro to UnSupervised Learning


Machine Learning
By Piyush Wairale • Unsupervised learning in artificial intelligence is a type of machine learning that learns
from data without human supervision.

• Unlike supervised learning, unsupervised machine learning models are given unlabeled
Instructions: data and allowed to discover patterns and insights without any explicit guidance or
instruction.
• Kindly go through the lectures/videos on our website www.piyushwairale.com
• Unsupervised learning is a machine learning paradigm where training examples lack
• Read this study material carefully and make your own handwritten short notes. (Short
labels, and clustering prototypes are typically initialized randomly. The primary goal is
notes must not be more than 5-6 pages)
to optimize these cluster prototypes based on similarities among the training examples.
• Attempt the question available on portal.
• Unsupervised learning is a machine learning paradigm that deals with unlabeled data
• Revise this material at least 5 times and once you have prepared your short notes, then and aims to group similar data items into clusters. It differs from supervised learning,
revise your short notes twice a week where labeled data is used for classification or regression tasks. Unsupervised learning
has applications in text clustering and other domains and can be adapted for supervised
• If you are not able to understand any topic or required detailed explanation, learning when necessary.
please mention it in our discussion forum on webiste
• Unsupervised learning is the opposite of supervised learning. In supervised learning,
• Let me know, if there are any typos or mistakes in the study materials. training examples are labeled with output values, and the algorithms aim to mini-
Mail me at piyushwairale100@gmail.com mize errors or misclassifications. In unsupervised learning, the focus is on maximizing
similarities between cluster prototypes and data items.

• Unsupervised learning doesn’t refer to a specific algorithm but rather a general frame-
work. The process involves deciding on the number of clusters, initializing cluster
prototypes, and iteratively assigning data items to clusters based on similarity. These
clusters are then updated until convergence is achieved.

• Unsupervised learning algorithms have various real-world applications. For instance,


they can be used in text clustering, as seen in the application of algorithms like AHC
and Kohonen Networks to manage and cluster textual data. They can also be used for
tasks like detecting redundancy in national research projects.

• While unsupervised learning focuses on clustering, it is possible to modify unsupervised


learning algorithms to function in supervised learning scenarios.

1.1 Working
As the name suggests, unsupervised learning uses self-learning algorithms—they learn with-
out any labels or prior training. Instead, the model is given raw, unlabeled data and has
to infer its own rules and structure the information based on similarities, differences, and
patterns without explicit instructions on how to work with each piece of data.

1 2

For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material

Unsupervised learning algorithms are better suited for more complex processing tasks,
such as organizing large datasets into clusters. They are useful for identifying previously
undetected patterns in data and can help identify features useful for categorizing data.

Imagine that you have a large dataset about weather. An unsupervised learning algo-
rithm will go through the data and identify patterns in the data points. For instance, it
might group data by temperature or similar weather patterns.
While the algorithm itself does not understand these patterns based on any previous infor-
mation you provided, you can then go through the data groupings and attempt to classify
them based on your understanding of the dataset. For instance, you might recognize that
the different temperature groups represent all four seasons or that the weather patterns are
separated into different types of weather, such as rain, sleet, or snow.

Clustering is one of the most popular unsupervised machine learning approaches. There
are several types of unsupervised learning algorithms that are used for clustering, which
include exclusive, overlapping, hierarchical, and probabilistic.

• Exclusive clustering: Data is grouped in a way where a single data point


can only exist in one cluster. This is also referred to as “hard” clustering. A
common example of exclusive clustering is the K-means clustering algorithm,
which partitions data points into a user-defined number K of clusters.
• Overlapping clustering: Data is grouped in a way where a single data point
1.2 Unsupervised machine learning methods can exist in two or more clusters with different degrees of membership. This is
In general, there are three types of unsupervised learning tasks: clustering, association rules, also referred to as “soft” clustering.
and dimensionality reduction. • Hierarchical clustering: Data is divided into distinct clusters based on similar-
ities, which are then repeatedly merged and organized based on their hierarchical
1. Clustering relationships. There are two main types of hierarchical clustering: agglomerative
Clustering is a technique for exploring raw, unlabeled data and breaking it down into and divisive clustering. This method is also referred to as HAC—hierarchical
groups (or clusters) based on similarities or differences. It is used in a variety of cluster analysis.
applications, including customer segmentation, fraud detection, and image analysis. • Probabilistic clustering: Data is grouped into clusters based on the probability
Clustering algorithms split data into natural groups by finding similar structures or of each data point belonging to each cluster. This approach differs from the other
patterns in uncategorized data. methods, which group data points based on their similarities to others in a cluster.

2. Association
Association rule mining is a rule-based approach to reveal interesting relationships
between data points in large datasets. Unsupervised learning algorithms search for fre-
quent if-then associations—also called rules—to discover correlations and co-occurrences
within the data and the different connections between data objects.
It is most commonly used to analyze retail baskets or transactional datasets to represent
how often certain items are purchased together. These algorithms uncover customer

3 4

For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material

purchasing patterns and previously hidden relationships between products that help 1.3 Supervised learning vs. unsupervised learning
inform recommendation engines or other cross-selling opportunities. You might be
The main difference between supervised learning and unsupervised learning is the type of
most familiar with these rules from the “Frequently bought together” and “People
input data that you use. Unlike unsupervised machine learning algorithms, supervised learn-
who bought this item also bought” sections on your favorite online retail shop.
ing relies on labeled training data to determine whether pattern recognition within a dataset
Association rules are also often used to organize medical datasets for clinical diagnoses. is accurate.
Using unsupervised machine learning and association rules can help doctors identify The goals of supervised learning models are also predetermined, meaning that the type
the probability of a specific diagnosis by comparing relationships between symptoms of output of a model is already known before the algorithms are applied. In other words,
from past patient cases. the input is mapped to the output based on the training data.
Typically, Apriori algorithms are the most widely used for association rule learning to
identify related collections of items or sets of items. However, other types are used, Supervised Learning Unsupervised Learning
such as Eclat and FP-growth algorithms. Supervised learning algorithms are trained Unsupervised learning algorithms are trained
using labeled data. using unlabeled data.
3. Dimensionality reduction Supervised learning model takes direct feed- Unsupervised learning model does not take
Dimensionality reduction is an unsupervised learning technique that reduces the num- back to check if it is predicting correct output any feedback.
ber of features, or dimensions, in a dataset. More data is generally better for machine or not.
learning, but it can also make it more challenging to visualize the data. Supervised learning model predicts the out- Unsupervised learning model finds the hid-
put. den patterns in data.
Dimensionality reduction extracts important features from the dataset, reducing the
In supervised learning, input data is provided In unsupervised learning, only input data is
number of irrelevant or random features present. This method uses principle compo-
to the model along with the output. provided to the model.
nent analysis (PCA) and singular value decomposition (SVD) algorithms to reduce
The goal of supervised learning is to train the The goal of unsupervised learning is to find
the number of data inputs without compromising the integrity of the properties in the
model so that it can predict the output when the hidden patterns and useful insights from
original data.
it is given new data. the unknown dataset.
Supervised learning needs supervision to Unsupervised learning does not need any su-
train the model. pervision to train the model.
Supervised learning can be categorized in Unsupervised Learning can be classified in
Classification and Regression problems. Clustering and Associations problems.
Supervised learning can be used for those Unsupervised learning can be used for those
cases where we know the input as well as cor- cases where we have only input data and no
responding outputs. corresponding output data.
Supervised learning model produces an accu- Unsupervised learning model may give less
rate result. accurate result as compared to supervised
learning.
Supervised learning is not close to true Arti- Unsupervised learning is more close to the
ficial intelligence as in this, we first train the true Artificial Intelligence as it learns simi-
model for each data, and then only it can larly as a child learns daily routine things by
predict the correct output. his experiences.
It includes various algorithms such as Lin- It includes various algorithms such as Clus-
ear Regression, Logistic Regression, Support tering, KNN, and Apriori algorithm.
Vector Machine, Multi-class Classification,
Decision tree, Bayesian Logic, etc.

5 6

For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material

1.4 Types of Clustering 2 K-Means Clustering


Clustering is a type of unsupervised learning wherein data points are grouped into different K-Medoids and K-Means are two types of clustering mechanisms in Partition Clustering.
sets based on their degree of similarity. First, Clustering is the process of breaking down an abstract group of data points/ objects
into classes of similar objects such that all the objects in one cluster have similar traits. , a
The various types of clustering are: group of n objects is broken down into k number of clusters based on their similarities.
• Hierarchical clustering • The K-Means algorithm is a popular clustering algorithm used in unsupervised machine
• Partitioning clustering learning to partition a dataset into K distinct, non-overlapping clusters.

Hierarchical clustering is further subdivided into: • It aims to find cluster centers (centroids) and assign data points to the nearest centroid
based on their similarity.
• Agglomerative clustering
• K-means clustering is one of the simplest and popular unsupervised machine learning
• Divisive clustering algorithms.

Partitioning clustering is further subdivided into: • Typically, unsupervised algorithms make inferences from datasets using only input
vectors without referring to known, or labelled, outcomes.
• K-Means clustering
• A cluster refers to a collection of data points aggregated together because of certain
• Fuzzy C-Means clustering similarities.
There are many different fields where cluster analysis is used effectively, such as • K-means is a very simple to implement clustering algorithm that works by selecting k
centroids initially, where k serves as the input to the algorithm and can be defined as
• Text data mining: this includes tasks such as text categorization, text clustering,
the number of clusters required. The centroids serve as the center of each new cluster.
document summarization, concept extraction, sentiment analysis, and entity relation
modelling • We first assign each data point of the given data to the nearest centroid. Next, we
calculate the mean of each cluster and the means then serve as the new centroids. This
• Customer segmentation: creating clusters of customers on the basis of parameters
step is repeated until the positions of the centroids do not change anymore.
such as demographics, financial conditions, buying habits, etc., which can be used by
retailers and advertisers to promote their products in the correct segment • The goal of k-means is to minimize the sum of the squared distances between each
data point and its centroid.
• Anomaly checking: checking of anomalous behaviours such as fraudulent bank trans-
action, unauthorized computer intrusion, suspicious movements on a radar scanner, • The algorithm aims to minimize the sum of squared distances between data points and
etc. their respective cluster centroids.
• Data mining: simplify the data mining task by grouping a large number of features ni
K X
from an extremely large data set to make the analysis manageable X
J= ∥xij − ci ∥2
i=1 j=1

where:

K is the number of clusters,


ni is the number of data points in cluster i,
xij is the j-th data point in cluster i,
ci is the centroid of cluster i.

7 8

For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material

• The choice of the number of clusters (K) is a critical decision and often requires do-
main knowledge or experimentation. Various methods, such as the elbow method or
silhouette score, can help in determining an optimal K value.

• K-Means is computationally efficient and works well for large datasets, but it may not
perform well on data with irregularly shaped or non-convex clusters.

• The algorithm may converge to a local minimum, and it’s not guaranteed to find the
global optimum.

Here’s an overview of the K-Means algorithm:

• Initialization: Choose the number of clusters (K) you want to create. Initialize K
cluster centroids randomly. These centroids can be selected from the data points or
with random values.

• Assignment Step: For each data point in the dataset, calculate the distance between
the data point and all K centroids. Assign the data point to the cluster associated with
the nearest centroid. This step groups data points into clusters based on similarity.

• Update Step: Recalculate the centroids for each cluster by taking the mean of all
data points assigned to that cluster. The new centroids represent the center of their Strength and Weakness of K-means
respective clusters.

• Convergence Check: Check if the algorithm has converged. Convergence occurs


when the centroids no longer change significantly between iterations. If convergence is
not reached, return to the Assignment and Update steps.

• Termination: Once convergence is achieved or a predetermined number of iterations


are completed, the algorithm terminates. The final cluster assignments and centroids
are obtained.

• Results: The result of the K-Means algorithm is K clusters, each with its centroid.
Data points are assigned to the nearest cluster, and you can use these clusters for
various purposes, such as data analysis, segmentation, or pattern recognition.

Key Points and Considerations:

• K-Means is sensitive to the initial placement of centroids. Different initializations can


lead to different results. Therefore, it’s common to run the algorithm multiple times
with different initializations and select the best result.

9 10

For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material

2.1 K-medoids algorithm 3 Hierarchical clustering


• K-medoids is an unsupervised method with unlabelled data to be clustered. It is an Till now, we have discussed the various methods for partitioning the data into different clus-
improvised version of the K-Means algorithm mainly designed to deal with outlier data ters. But there are situations when the data needs to be partitioned into groups at different
sensitivity. levels such as in a hierarchy. The hierarchical clustering methods are used to group the data
• Compared to other partitioning algorithms, the algorithm is simple, fast, and easy to into hierarchy or tree-like structure.
implement. For example, in a machine learning problem of organizing employees of a university in dif-
ferent departments, first the employees are grouped under the different departments in the
• K-medoids clustering method but unlike k-means, rather than minimizing the sum of university, and then within each department, the employees can be grouped according to
squared distances, k-medoids works on minimizing the number of paired dissimilarities. their roles such as professors, assistant professors, supervisors, lab assistants, etc. This
creates a hierarchical structure of the employee data and eases visualization and analysis.
• We find this useful since k-medoids aims to form clusters where the objects within each Similarly, there may be a data set which has an underlying hierarchy structure that we want
cluster are more similar to each other and dissimilar to objects in the other clusters. to discover and we can use the hierarchical clustering methods to achieve that.
Instead of centroids, this approach makes use of medoids.
• Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method
• Medoids are points in the dataset whose sum of distances to other points in the cluster
of cluster analysis which seeks to build a hierarchy of clusters (or groups) in a given
is minimal.
dataset.
There are three types of algorithms for K-Medoids Clustering: • The hierarchical clustering produces clusters in which the clusters at each level of the
hierarchy are created by merging clusters at the next lower level.
1. PAM (Partitioning Around Clustering)
• At the lowest level, each cluster contains a single observation. At the highest level
2. CLARA (Clustering Large Applications)
there is only one cluster containing all of the data.
3. CLARANS (Randomized Clustering Large Applications)
• The decision regarding whether two clusters are to be merged or not is taken based on
PAM is the most powerful algorithm of the three algorithms but has the disadvantage of the measure of dissimilarity between the clusters. The distance between two clusters
time complexity. The following K-Medoids are performed using PAM. In the further parts, is usually taken as the measure of dissimilarity between the clusters.
we’ll see what CLARA and CLARANS are.
• Hierarchical clustering is more interpretable than other clustering techniques because
it provides a full hierarchy of clusters.

• The choice of linkage method and distance metric can significantly impact the results
and the structure of the dendrogram.

• Dendrograms are useful for visualizing the hierarchy and helping to decide how many
clusters are appropriate for a particular application.

• Hierarchical clustering can be computationally intensive, especially for large datasets,


and may not be suitable for such cases.

11 12

For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material

3.1 Hierarchical Clustering Methods nodes (that is, the nodes representing the singleton clusters) are explicitly displayed. Figure
13.8 shows the simplified format of the dendrogram in Figure 13.7. Figure 13.9 shows the
There are two main hierarchical clustering methods: agglomerative clustering and divi-
distances of the clusters at the various levels. Note that the clusters are at 4 levels. The
sive clustering.
distance between the clusters {a} and {b} is 15, between {c} and {d} is 7.5, between {c, d}
Agglomerative clustering is a bottom-up technique which starts with individual objects
and {e} is 15 and between {a, b} and {c, d, e} is 25.
as clusters and then iteratively merges them to form larger clusters. On the other hand, the
divisive method starts with one cluster with all given objects 2 and then splits it iteratively
to form smaller clusters.
In both these cases, it is important to select the split and merger points carefully, because
the subsequent splits or mergers will use the result of the previous ones and there is no
option to perform any object swapping between the clusters or rectify the decisions made in
previous steps, which may result in poor clustering quality at the end.

A dendrogram is a commonly used tree structure representation of step-by-step creation


of hierarchical clustering. It shows how the clusters are merged iteratively (in the case of
agglomerative clustering) or split iteratively (in the case of divisive clustering) to arrive at
the optimal clustering solution.

3.1.1 Dendrogram
• Hierarchical clustering can be represented by a rooted binary tree. The nodes of the
trees represent groups or clusters. The root node represents the entire data set. The
terminal nodes each represent one of the individual observations (singleton clusters).
Each nonterminal node has two daughter nodes.

• The distance between merged clusters is monotone increasing with the level of the
merger. The height of each node above the level of the terminal nodes in the tree is
proportional to the value of the distance between its two daughters (see Figure 13.9).

• A dendrogram is a tree diagram used to illustrate the arrangement of the clusters


produced by hierarchical clustering.

• The dendrogram may be drawn with the root node at the top and the branches growing
vertically downwards (see Figure 13.8(a)).

• It may also be drawn with the root node at the left and the branches growing horizon-
tally rightwards (see Figure 13.8(b)).

• In some contexts, the opposite directions may also be more appropriate.

• Dendrograms are commonly used in computational biology to illustrate the clustering Figure 13.9: A dendrogram of the dataset a, b, c, d, e showing the distances (heights) of the
of genes or samples. clusters at different levels

Example Figure 13.7 is a dendrogram of the dataset {a, b, c, d, e}. Note that the root node
represents the entire dataset and the terminal nodes represent the individual observations.
However, the dendrograms are presented in a simplified format in which only the terminal

13 14

For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material

3.1.2 Agglomerative Hierarchical Clustering (Bottom-Up) – Complete Linkage: Merge clusters based on the maximum distance between
any pair of data points from the two clusters.
The agglomerative hierarchical clustering method uses the bottom-up strategy. It starts
with each object forming its own cluster and then iteratively merges the clusters according – Average Linkage: Merge clusters based on the average distance between data
to their similarity to form larger clusters. It terminates either when a certain clustering points in the two clusters.
condition imposed by the user is achieved or all the clusters merge into a single cluster. – Ward’s Method: Minimize the increase in variance when merging clusters. Re-
peat this merging process iteratively until all data points are in a single cluster
or until you reach the desired number of clusters.

• Step 3: Dendrogram During the merging process, create a dendrogram, which is a


tree-like structure that represents the hierarchy of clusters. The dendrogram provides
a visual representation of how clusters merge and shows the relationships between
clusters at different levels of granularity.

• Step 4: Cutting the Dendrogram To determine the number of clusters, you can cut
the dendrogram at a specific level. The height at which you cut the dendrogram
corresponds to the number of clusters you obtain. The cut produces the final clusters
at the chosen level of granularity.

• Step 5: Results The resulting clusters are obtained based on the cut level. Each cluster
contains a set of data points that are similar to each other according to the chosen
linkage method.

For example, the hierarchical clustering shown in Figure 13.7 can be constructed by the
agglomerative method as shown in Figure 13.10. Each nonterminal node has two daughter
nodes. The daughters represent the two groups that were merged to form the parent.

If there are N observations in the dataset, there will be N − 1 levels in the hierarchy.
The pair chosen for merging consists of the two groups with the smallest “intergroup dissim-
ilarity”. Each nonterminal node has two daughter nodes. The daughters represent the two
groups that were merged to form the parent.

Here’s a step-by-step explanation of the agglomerative hierarchical clustering


algorithm:

• Step 1: Initialization Start with each data point as a separate cluster. If you have N
data points, you initially have N clusters.

• Step 2: Merge Clusters Calculate the pairwise distances (similarity or dissimilarity)


between all clusters. Common distance metrics include Euclidean distance, Manhattan
distance, or other similarity measures. Merge the two closest clusters based on the
distance metric. There are various linkage methods to define cluster distance:

– Single Linkage: Merge clusters based on the minimum distance between any
pair of data points from the two clusters.

15 16

For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material

3.1.3 Divisive Hierarchical Clustering (Top-Down)


• Top-down hierarchical clustering, also known as divisive hierarchical clustering, is a
clustering algorithm that starts with all data points in a single cluster and recursively
divides clusters into smaller sub-clusters until each data point forms its own individual
cluster or a specified number of clusters is reached.
• The divisive hierarchical clustering method uses a top-down strategy. The starting
point is the largest cluster with all the objects in it, and then, it is split recursively
to form smaller and smaller clusters, thus forming the hierarchy. The end of itera-
tions is achieved when the objects in the final clusters are sufficiently homogeneous to
each other or the final clusters contain only one object or the user-defined clustering
condition is achieved.
• The divisive method starts at the top and at each level recursively split one of the
existing clusters at that level into two new clusters.
• If there are N observations in the dataset, there the divisive method also will produce
N − 1 levels in the hierarchy. The split is chosen to produce two new groups with the
largest “between-group dissimilarity”.

Here’s a step-by-step explanation of the divisive hierarchical clustering al-


gorithm:
– Step 1: Initialization
Start with all data points as members of a single cluster. If you have N data
points, you initially have one cluster with N members.
– Step 2: Split Clusters
∗ Calculate the within-cluster variance or a similar measure for the current
cluster. This measure represents the compactness of the cluster.
∗ Divide the cluster into two sub-clusters in a way that minimizes the within-
cluster variance. There are various methods to achieve this, such as k-means
or hierarchical clustering on the sub-cluster.
∗ Repeat the splitting process recursively for each sub-cluster until you reach
the desired number of clusters.
Figure 13.10: Hierarchical clustering using agglomerative method – Step 3: Dendrogram (Optional)
While divisive hierarchical clustering often doesn’t produce a dendrogram like
agglomerative clustering, you can still record the hierarchy of cluster splits for
analysis if needed.
– Step 4: Results
The resulting clusters are obtained based on the recursive splits. Each cluster
contains a set of data points that are similar to each other according to the
splitting criteria.

17 18

For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material

For example, the hierarchical clustering shown in Figure 13.7 can be constructed by the two observations can be defined and also there are also several ways in which the distance
divisive method as shown in Figure 13.11. Each nonterminal node has two daughter nodes. between two groups of observations can be defined
The two daughters represent the two groups resulting from the split of the parent.
3.2.1 Measures of distance between data points
One of the core measures of proximities between clusters is the distance between them. There
are four standard methods to measure the distance between clusters: Let Ci and Cj be the
two clusters with ni and nj respectively. pi and pj represents the points in clusters Ci and
Cj respectively. We will denote the mean of cluster Ci as mi .

Figure 13.11: Hierarchical clustering using divisive method


Distance measure in algorithmic methods
In both these cases, it is important to select the split and merger points carefully, because
the subsequent splits or mergers will use the result of the previous ones and there is no option
Often the distance measure is used to decide when to terminate the clustering algorithm.
to perform any object swapping between the clusters or rectify the decisions made in previous
For example, in an agglomerative clustering, the merging iterations may be stopped once the
steps, which may result in poor clustering quality at the end.
MIN distance between two neighboring clusters becomes less than the user-defined threshold.
So, when an algorithm uses the minimum distance Dmin to measure the distance between
3.2 Measures of dissimilarity the clusters, then it is referred to as nearest neighbour clustering algorithm, and if the decision
In order to decide which clusters should be combined (for agglomerative), or where a cluster to stop the algorithm is based on a user-defined limit on Dmin , then it is called a single
should be split (for divisive), a measure of dissimilarity between sets of observations is linkage algorithm.
required. In most methods of hierarchical clustering, the dissimilarity between two groups On the other hand, when an algorithm uses the maximum distance Dmax to measure
of observations is measured by using an appropriate measure of distance between the groups the distance between the clusters, then it is referred to as furthest neighbour clustering
of observations. The distance between two groups of observations is defined in terms of the algorithm, and if the decision to stop the algorithm is based on a userdefined limit on Dmax
distance between two observations. There are several ways in which the distance between then it is called complete linkage algorithm.

19 20

For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material

As minimum and maximum measures provide two extreme options to measure distance 3.4 Multiple Linkage
between the clusters, they are prone to the outliers and noisy data. Instead, the use of mean
Multiple linkage, also known as complete linkage or maximum linkage, is a hierarchical
and average distance helps in avoiding such problem and provides more consistent results.
clustering method used in agglomerative hierarchical clustering. It defines the distance
between two clusters as the maximum distance between any two data points, one from each
3.3 Single linkage cluster. In contrast to single linkage, which uses the minimum distance, complete linkage
Single linkage, also known as single-link clustering, is a hierarchical agglomerative clustering aims to minimize the maximum distance between any two data points within the merged
method used in unsupervised machine learning. It is one of the linkage methods used in clusters.
agglomerative hierarchical clustering. Single linkage defines the distance between two clusters Multiple linkage works:
as the minimum distance between any two data points, one from each cluster.
Single linkage works: 1. Initialization: Start with each data point as an individual cluster. If you have N data
1. Initialization: Start with each data point as an individual cluster. If you have N data points, you initially have N clusters.
points, you initially have N clusters. 2. Cluster Distance: Calculate the pairwise distances between all clusters. The distance
2. Cluster Distance: Calculate the pairwise distances between all clusters. The distance between two clusters is defined as the maximum distance between any two data points,
between two clusters is defined as the minimum distance between any two data points, one from each cluster.
one from each cluster. 3. Merge Clusters: Merge the two clusters with the shortest (maximum) distance, as
3. Merge Clusters: Merge the two clusters with the shortest distance, as defined by defined by complete linkage. This creates a new, larger cluster.
single linkage. This creates a new, larger cluster. 4. Repeat: Continue steps 2 and 3 iteratively until all data points are part of a single
4. Repeat: Continue steps 2 and 3 iteratively until all data points are part of a single cluster or you reach a specified number of clusters.
cluster, or you reach a specified number of clusters. 5. Dendrogram: Create a dendrogram to represent the hierarchy of cluster mergers.
5. Dendrogram: During the process, create a dendrogram, which is a tree-like structure The dendrogram records the sequence of cluster merges.
that represents the hierarchy of clusters. It records the sequence of cluster mergers. 6. Cutting the Dendrogram: To determine the number of clusters, you can cut the
6. Cutting the Dendrogram: To determine the number of clusters, you can cut the dendrogram at a specific level. The height at which you cut the dendrogram corre-
dendrogram at a specific level. The height at which you cut the dendrogram corre- sponds to the number of clusters you obtain.
sponds to the number of clusters you obtain. Multiple linkage has some characteristics:
Single linkage has some characteristics:
• It tends to produce compact, spherical clusters since it minimizes the maximum dis-
• It is sensitive to outliers and noise because a single close pair of points from different tance within clusters.
clusters can cause a merger. • It is less sensitive to outliers than single linkage because it focuses on the maximum
• It tends to create elongated clusters, as it connects clusters based on single, nearest distance.
neighbors. • It can handle elongated or irregularly shaped clusters effectively.
• It is fast and can handle large datasets, making it computationally efficient. The choice of linkage method (single, complete, average, etc.) depends on the nature of the
• Single linkage is just one of several linkage methods used in hierarchical clustering, data and the desired clustering outcome. Different linkage methods can produce different
each with its own strengths and weaknesses. The choice of linkage method depends on cluster structures based on how distance is defined between clusters.
the nature of the data and the desired clustering outcome.

21 22

For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material

3.5 Dimensionality Reduction 3.6 Principal Component Analysis (PCA)


Dimensionality reduction is a technique used in machine learning and data analysis to reduce Principal Component Analysis (PCA) is a widely used dimensionality reduction technique
the number of features (dimensions) in a dataset. It involves transforming a high-dimensional in the fields of statistics, machine learning, and data analysis. It aims to transform high-
dataset into a lower-dimensional representation while retaining the most relevant informa- dimensional data into a lower-dimensional representation while retaining the most important
tion. Dimensionality reduction is useful for several reasons: information and patterns.
PCA achieves this by finding a set of orthogonal axes, known as principal components,
• Curse of Dimensionality: High-dimensional data can suffer from the curse of dimen- in the high-dimensional data space. Each principal component is a linear combination of the
sionality, where the data becomes sparse and noisy, making it challenging to analyze original features.
and model effectively. Reducing dimensionality can help mitigate this issue. Here’s a step-by-step explanation of how PCA works:
• Computation and Storage: High-dimensional data requires more computational 1. Data Standardization: Start with a dataset of high-dimensional data, where each
resources and storage. Reducing dimensionality can make algorithms faster and more column represents a feature or variable. Standardize the data by subtracting the mean
memory-efficient. and dividing by the standard deviation for each feature. This step ensures that all
• Visualization: It is difficult to visualize data in more than three dimensions. Di- features have the same scale.
mensionality reduction techniques can project data into a lower-dimensional space for 2. Covariance Matrix Calculation: Calculate the covariance matrix of the standard-
visualization purposes. ized data. The covariance matrix is a square matrix where each element represents the
• Feature Engineering: Dimensionality reduction can be a form of automated feature covariance between pairs of features.
selection, helping to identify the most important features and reducing the risk of 3. Eigenvalue Decomposition: Perform eigenvalue decomposition on the covariance
overfitting. matrix to extract eigenvalues and eigenvectors. The eigenvalues represent the amount
There are two main approaches to dimensionality reduction: of variance explained by each principal component. The eigenvectors represent the
directions in the high-dimensional space that maximize the variance.
• Feature Selection: In this approach, you select a subset of the original features
and discard the rest. This subset contains the most relevant features for the task 4. Sort Eigenvalues: Order the eigenvalues in descending order. The first eigenvalue
at hand. Common methods for feature selection include correlation analysis, mutual corresponds to the most significant variance, the second eigenvalue to the second-most
information, and filter methods. significant variance, and so on.

• Feature Extraction: In this approach, you create new features that are combinations 5. Select Principal Components: Decide how many principal components you want
or transformations of the original features. Principal Component Analysis (PCA) and to retain in the lower-dimensional representation. This choice can be based on the
Linear Discriminant Analysis (LDA) are popular techniques for feature extraction. proportion of explained variance or specific requirements for dimensionality reduction.
They find linear combinations of the original features that capture the most significant 6. Projection: Use the selected principal components (eigenvectors) to transform the
variation in the data. data. Each data point is projected onto the subspace defined by these principal com-
Popular algorithms used for dimensionality reduction include principal component anal- ponents. This transformation results in a lower-dimensional representation of the data.
ysis (PCA).These algorithms seek to transform data from high-dimensional spaces to low- 7. Variance Explained: Calculate the proportion of total variance explained by the
dimensional spaces without compromising meaningful properties in the original data. These retained principal components. This information can help you assess the quality of the
techniques are typically deployed during exploratory data analysis (EDA) or data processing dimensionality reduction.
to prepare the data for modeling.
It’s helpful to reduce the dimensionality of a dataset during EDA to help visualize data: 8. Visualization and Analysis: Visualize and analyze the lower-dimensional data to
this is because visualizing data in more than three dimensions is difficult. From a data pro- gain insights, identify patterns, or facilitate further data analysis. Principal com-
cessing perspective, reducing the dimensionality of the data simplifies the modeling problem. ponents can be interpreted to understand the relationships between features in the
original data.

23 24

For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material

9. Inverse Transformation (Optional): If necessary, you can perform an inverse trans- References
formation to map the reduced-dimensional data back to the original high-dimensional
space. However, some information may be lost in this process. • Lecture Notes in MACHINE LEARNING, by Dr V N Krishnachandran

10. Application: Use the lower-dimensional data for various tasks, such as visualization, • Machine Learning by Amit Kumar Das Saikat Dutt, Subramanian Chandramouli
clustering, classification, or regression, with reduced computational complexity and • https://alexjungaalto.github.io/MLBasicsBook.pdf
noise.
• Taeho Jo Machine Learning Foundations Supervised, Unsupervised, and Advanced
PCA provides several benefits: Learning Springer book
• Dimensionality Reduction: By selecting a subset of principal components, you can • IIT Madras BS Degree Lectures and Notes
reduce the dimensionality of your data while retaining most of the variance. This is
especially useful when dealing with high-dimensional data. • NPTEL Lectures and Slides
• Noise Reduction: PCA can help filter out noise in the data, leading to cleaner and • www.medium.com
more interpretable patterns.
• geeksforgeeks.org/
• Visualization: PCA facilitates the visualization of data in lower dimensions, making
it easier to understand and explore complex datasets. • javatpoint.com/

• Feature Engineering: PCA can serve as a form of feature engineering, identifying


the most important features for a given task.

25 26

For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com

You might also like