ML Unsupervised Notes-New
ML Unsupervised Notes-New
• Unlike supervised learning, unsupervised machine learning models are given unlabeled
Instructions: data and allowed to discover patterns and insights without any explicit guidance or
instruction.
• Kindly go through the lectures/videos on our website www.piyushwairale.com
• Unsupervised learning is a machine learning paradigm where training examples lack
• Read this study material carefully and make your own handwritten short notes. (Short
labels, and clustering prototypes are typically initialized randomly. The primary goal is
notes must not be more than 5-6 pages)
to optimize these cluster prototypes based on similarities among the training examples.
• Attempt the question available on portal.
• Unsupervised learning is a machine learning paradigm that deals with unlabeled data
• Revise this material at least 5 times and once you have prepared your short notes, then and aims to group similar data items into clusters. It differs from supervised learning,
revise your short notes twice a week where labeled data is used for classification or regression tasks. Unsupervised learning
has applications in text clustering and other domains and can be adapted for supervised
• If you are not able to understand any topic or required detailed explanation, learning when necessary.
please mention it in our discussion forum on webiste
• Unsupervised learning is the opposite of supervised learning. In supervised learning,
• Let me know, if there are any typos or mistakes in the study materials. training examples are labeled with output values, and the algorithms aim to mini-
Mail me at piyushwairale100@gmail.com mize errors or misclassifications. In unsupervised learning, the focus is on maximizing
similarities between cluster prototypes and data items.
• Unsupervised learning doesn’t refer to a specific algorithm but rather a general frame-
work. The process involves deciding on the number of clusters, initializing cluster
prototypes, and iteratively assigning data items to clusters based on similarity. These
clusters are then updated until convergence is achieved.
1.1 Working
As the name suggests, unsupervised learning uses self-learning algorithms—they learn with-
out any labels or prior training. Instead, the model is given raw, unlabeled data and has
to infer its own rules and structure the information based on similarities, differences, and
patterns without explicit instructions on how to work with each piece of data.
1 2
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
Unsupervised learning algorithms are better suited for more complex processing tasks,
such as organizing large datasets into clusters. They are useful for identifying previously
undetected patterns in data and can help identify features useful for categorizing data.
Imagine that you have a large dataset about weather. An unsupervised learning algo-
rithm will go through the data and identify patterns in the data points. For instance, it
might group data by temperature or similar weather patterns.
While the algorithm itself does not understand these patterns based on any previous infor-
mation you provided, you can then go through the data groupings and attempt to classify
them based on your understanding of the dataset. For instance, you might recognize that
the different temperature groups represent all four seasons or that the weather patterns are
separated into different types of weather, such as rain, sleet, or snow.
Clustering is one of the most popular unsupervised machine learning approaches. There
are several types of unsupervised learning algorithms that are used for clustering, which
include exclusive, overlapping, hierarchical, and probabilistic.
2. Association
Association rule mining is a rule-based approach to reveal interesting relationships
between data points in large datasets. Unsupervised learning algorithms search for fre-
quent if-then associations—also called rules—to discover correlations and co-occurrences
within the data and the different connections between data objects.
It is most commonly used to analyze retail baskets or transactional datasets to represent
how often certain items are purchased together. These algorithms uncover customer
3 4
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
purchasing patterns and previously hidden relationships between products that help 1.3 Supervised learning vs. unsupervised learning
inform recommendation engines or other cross-selling opportunities. You might be
The main difference between supervised learning and unsupervised learning is the type of
most familiar with these rules from the “Frequently bought together” and “People
input data that you use. Unlike unsupervised machine learning algorithms, supervised learn-
who bought this item also bought” sections on your favorite online retail shop.
ing relies on labeled training data to determine whether pattern recognition within a dataset
Association rules are also often used to organize medical datasets for clinical diagnoses. is accurate.
Using unsupervised machine learning and association rules can help doctors identify The goals of supervised learning models are also predetermined, meaning that the type
the probability of a specific diagnosis by comparing relationships between symptoms of output of a model is already known before the algorithms are applied. In other words,
from past patient cases. the input is mapped to the output based on the training data.
Typically, Apriori algorithms are the most widely used for association rule learning to
identify related collections of items or sets of items. However, other types are used, Supervised Learning Unsupervised Learning
such as Eclat and FP-growth algorithms. Supervised learning algorithms are trained Unsupervised learning algorithms are trained
using labeled data. using unlabeled data.
3. Dimensionality reduction Supervised learning model takes direct feed- Unsupervised learning model does not take
Dimensionality reduction is an unsupervised learning technique that reduces the num- back to check if it is predicting correct output any feedback.
ber of features, or dimensions, in a dataset. More data is generally better for machine or not.
learning, but it can also make it more challenging to visualize the data. Supervised learning model predicts the out- Unsupervised learning model finds the hid-
put. den patterns in data.
Dimensionality reduction extracts important features from the dataset, reducing the
In supervised learning, input data is provided In unsupervised learning, only input data is
number of irrelevant or random features present. This method uses principle compo-
to the model along with the output. provided to the model.
nent analysis (PCA) and singular value decomposition (SVD) algorithms to reduce
The goal of supervised learning is to train the The goal of unsupervised learning is to find
the number of data inputs without compromising the integrity of the properties in the
model so that it can predict the output when the hidden patterns and useful insights from
original data.
it is given new data. the unknown dataset.
Supervised learning needs supervision to Unsupervised learning does not need any su-
train the model. pervision to train the model.
Supervised learning can be categorized in Unsupervised Learning can be classified in
Classification and Regression problems. Clustering and Associations problems.
Supervised learning can be used for those Unsupervised learning can be used for those
cases where we know the input as well as cor- cases where we have only input data and no
responding outputs. corresponding output data.
Supervised learning model produces an accu- Unsupervised learning model may give less
rate result. accurate result as compared to supervised
learning.
Supervised learning is not close to true Arti- Unsupervised learning is more close to the
ficial intelligence as in this, we first train the true Artificial Intelligence as it learns simi-
model for each data, and then only it can larly as a child learns daily routine things by
predict the correct output. his experiences.
It includes various algorithms such as Lin- It includes various algorithms such as Clus-
ear Regression, Logistic Regression, Support tering, KNN, and Apriori algorithm.
Vector Machine, Multi-class Classification,
Decision tree, Bayesian Logic, etc.
5 6
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
Hierarchical clustering is further subdivided into: • It aims to find cluster centers (centroids) and assign data points to the nearest centroid
based on their similarity.
• Agglomerative clustering
• K-means clustering is one of the simplest and popular unsupervised machine learning
• Divisive clustering algorithms.
Partitioning clustering is further subdivided into: • Typically, unsupervised algorithms make inferences from datasets using only input
vectors without referring to known, or labelled, outcomes.
• K-Means clustering
• A cluster refers to a collection of data points aggregated together because of certain
• Fuzzy C-Means clustering similarities.
There are many different fields where cluster analysis is used effectively, such as • K-means is a very simple to implement clustering algorithm that works by selecting k
centroids initially, where k serves as the input to the algorithm and can be defined as
• Text data mining: this includes tasks such as text categorization, text clustering,
the number of clusters required. The centroids serve as the center of each new cluster.
document summarization, concept extraction, sentiment analysis, and entity relation
modelling • We first assign each data point of the given data to the nearest centroid. Next, we
calculate the mean of each cluster and the means then serve as the new centroids. This
• Customer segmentation: creating clusters of customers on the basis of parameters
step is repeated until the positions of the centroids do not change anymore.
such as demographics, financial conditions, buying habits, etc., which can be used by
retailers and advertisers to promote their products in the correct segment • The goal of k-means is to minimize the sum of the squared distances between each
data point and its centroid.
• Anomaly checking: checking of anomalous behaviours such as fraudulent bank trans-
action, unauthorized computer intrusion, suspicious movements on a radar scanner, • The algorithm aims to minimize the sum of squared distances between data points and
etc. their respective cluster centroids.
• Data mining: simplify the data mining task by grouping a large number of features ni
K X
from an extremely large data set to make the analysis manageable X
J= ∥xij − ci ∥2
i=1 j=1
where:
7 8
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
• The choice of the number of clusters (K) is a critical decision and often requires do-
main knowledge or experimentation. Various methods, such as the elbow method or
silhouette score, can help in determining an optimal K value.
• K-Means is computationally efficient and works well for large datasets, but it may not
perform well on data with irregularly shaped or non-convex clusters.
• The algorithm may converge to a local minimum, and it’s not guaranteed to find the
global optimum.
• Initialization: Choose the number of clusters (K) you want to create. Initialize K
cluster centroids randomly. These centroids can be selected from the data points or
with random values.
• Assignment Step: For each data point in the dataset, calculate the distance between
the data point and all K centroids. Assign the data point to the cluster associated with
the nearest centroid. This step groups data points into clusters based on similarity.
• Update Step: Recalculate the centroids for each cluster by taking the mean of all
data points assigned to that cluster. The new centroids represent the center of their Strength and Weakness of K-means
respective clusters.
• Results: The result of the K-Means algorithm is K clusters, each with its centroid.
Data points are assigned to the nearest cluster, and you can use these clusters for
various purposes, such as data analysis, segmentation, or pattern recognition.
9 10
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
• The choice of linkage method and distance metric can significantly impact the results
and the structure of the dendrogram.
• Dendrograms are useful for visualizing the hierarchy and helping to decide how many
clusters are appropriate for a particular application.
11 12
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
3.1 Hierarchical Clustering Methods nodes (that is, the nodes representing the singleton clusters) are explicitly displayed. Figure
13.8 shows the simplified format of the dendrogram in Figure 13.7. Figure 13.9 shows the
There are two main hierarchical clustering methods: agglomerative clustering and divi-
distances of the clusters at the various levels. Note that the clusters are at 4 levels. The
sive clustering.
distance between the clusters {a} and {b} is 15, between {c} and {d} is 7.5, between {c, d}
Agglomerative clustering is a bottom-up technique which starts with individual objects
and {e} is 15 and between {a, b} and {c, d, e} is 25.
as clusters and then iteratively merges them to form larger clusters. On the other hand, the
divisive method starts with one cluster with all given objects 2 and then splits it iteratively
to form smaller clusters.
In both these cases, it is important to select the split and merger points carefully, because
the subsequent splits or mergers will use the result of the previous ones and there is no
option to perform any object swapping between the clusters or rectify the decisions made in
previous steps, which may result in poor clustering quality at the end.
3.1.1 Dendrogram
• Hierarchical clustering can be represented by a rooted binary tree. The nodes of the
trees represent groups or clusters. The root node represents the entire data set. The
terminal nodes each represent one of the individual observations (singleton clusters).
Each nonterminal node has two daughter nodes.
• The distance between merged clusters is monotone increasing with the level of the
merger. The height of each node above the level of the terminal nodes in the tree is
proportional to the value of the distance between its two daughters (see Figure 13.9).
• The dendrogram may be drawn with the root node at the top and the branches growing
vertically downwards (see Figure 13.8(a)).
• It may also be drawn with the root node at the left and the branches growing horizon-
tally rightwards (see Figure 13.8(b)).
• Dendrograms are commonly used in computational biology to illustrate the clustering Figure 13.9: A dendrogram of the dataset a, b, c, d, e showing the distances (heights) of the
of genes or samples. clusters at different levels
Example Figure 13.7 is a dendrogram of the dataset {a, b, c, d, e}. Note that the root node
represents the entire dataset and the terminal nodes represent the individual observations.
However, the dendrograms are presented in a simplified format in which only the terminal
13 14
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
3.1.2 Agglomerative Hierarchical Clustering (Bottom-Up) – Complete Linkage: Merge clusters based on the maximum distance between
any pair of data points from the two clusters.
The agglomerative hierarchical clustering method uses the bottom-up strategy. It starts
with each object forming its own cluster and then iteratively merges the clusters according – Average Linkage: Merge clusters based on the average distance between data
to their similarity to form larger clusters. It terminates either when a certain clustering points in the two clusters.
condition imposed by the user is achieved or all the clusters merge into a single cluster. – Ward’s Method: Minimize the increase in variance when merging clusters. Re-
peat this merging process iteratively until all data points are in a single cluster
or until you reach the desired number of clusters.
• Step 4: Cutting the Dendrogram To determine the number of clusters, you can cut
the dendrogram at a specific level. The height at which you cut the dendrogram
corresponds to the number of clusters you obtain. The cut produces the final clusters
at the chosen level of granularity.
• Step 5: Results The resulting clusters are obtained based on the cut level. Each cluster
contains a set of data points that are similar to each other according to the chosen
linkage method.
For example, the hierarchical clustering shown in Figure 13.7 can be constructed by the
agglomerative method as shown in Figure 13.10. Each nonterminal node has two daughter
nodes. The daughters represent the two groups that were merged to form the parent.
If there are N observations in the dataset, there will be N − 1 levels in the hierarchy.
The pair chosen for merging consists of the two groups with the smallest “intergroup dissim-
ilarity”. Each nonterminal node has two daughter nodes. The daughters represent the two
groups that were merged to form the parent.
• Step 1: Initialization Start with each data point as a separate cluster. If you have N
data points, you initially have N clusters.
– Single Linkage: Merge clusters based on the minimum distance between any
pair of data points from the two clusters.
15 16
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
17 18
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
For example, the hierarchical clustering shown in Figure 13.7 can be constructed by the two observations can be defined and also there are also several ways in which the distance
divisive method as shown in Figure 13.11. Each nonterminal node has two daughter nodes. between two groups of observations can be defined
The two daughters represent the two groups resulting from the split of the parent.
3.2.1 Measures of distance between data points
One of the core measures of proximities between clusters is the distance between them. There
are four standard methods to measure the distance between clusters: Let Ci and Cj be the
two clusters with ni and nj respectively. pi and pj represents the points in clusters Ci and
Cj respectively. We will denote the mean of cluster Ci as mi .
19 20
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
As minimum and maximum measures provide two extreme options to measure distance 3.4 Multiple Linkage
between the clusters, they are prone to the outliers and noisy data. Instead, the use of mean
Multiple linkage, also known as complete linkage or maximum linkage, is a hierarchical
and average distance helps in avoiding such problem and provides more consistent results.
clustering method used in agglomerative hierarchical clustering. It defines the distance
between two clusters as the maximum distance between any two data points, one from each
3.3 Single linkage cluster. In contrast to single linkage, which uses the minimum distance, complete linkage
Single linkage, also known as single-link clustering, is a hierarchical agglomerative clustering aims to minimize the maximum distance between any two data points within the merged
method used in unsupervised machine learning. It is one of the linkage methods used in clusters.
agglomerative hierarchical clustering. Single linkage defines the distance between two clusters Multiple linkage works:
as the minimum distance between any two data points, one from each cluster.
Single linkage works: 1. Initialization: Start with each data point as an individual cluster. If you have N data
1. Initialization: Start with each data point as an individual cluster. If you have N data points, you initially have N clusters.
points, you initially have N clusters. 2. Cluster Distance: Calculate the pairwise distances between all clusters. The distance
2. Cluster Distance: Calculate the pairwise distances between all clusters. The distance between two clusters is defined as the maximum distance between any two data points,
between two clusters is defined as the minimum distance between any two data points, one from each cluster.
one from each cluster. 3. Merge Clusters: Merge the two clusters with the shortest (maximum) distance, as
3. Merge Clusters: Merge the two clusters with the shortest distance, as defined by defined by complete linkage. This creates a new, larger cluster.
single linkage. This creates a new, larger cluster. 4. Repeat: Continue steps 2 and 3 iteratively until all data points are part of a single
4. Repeat: Continue steps 2 and 3 iteratively until all data points are part of a single cluster or you reach a specified number of clusters.
cluster, or you reach a specified number of clusters. 5. Dendrogram: Create a dendrogram to represent the hierarchy of cluster mergers.
5. Dendrogram: During the process, create a dendrogram, which is a tree-like structure The dendrogram records the sequence of cluster merges.
that represents the hierarchy of clusters. It records the sequence of cluster mergers. 6. Cutting the Dendrogram: To determine the number of clusters, you can cut the
6. Cutting the Dendrogram: To determine the number of clusters, you can cut the dendrogram at a specific level. The height at which you cut the dendrogram corre-
dendrogram at a specific level. The height at which you cut the dendrogram corre- sponds to the number of clusters you obtain.
sponds to the number of clusters you obtain. Multiple linkage has some characteristics:
Single linkage has some characteristics:
• It tends to produce compact, spherical clusters since it minimizes the maximum dis-
• It is sensitive to outliers and noise because a single close pair of points from different tance within clusters.
clusters can cause a merger. • It is less sensitive to outliers than single linkage because it focuses on the maximum
• It tends to create elongated clusters, as it connects clusters based on single, nearest distance.
neighbors. • It can handle elongated or irregularly shaped clusters effectively.
• It is fast and can handle large datasets, making it computationally efficient. The choice of linkage method (single, complete, average, etc.) depends on the nature of the
• Single linkage is just one of several linkage methods used in hierarchical clustering, data and the desired clustering outcome. Different linkage methods can produce different
each with its own strengths and weaknesses. The choice of linkage method depends on cluster structures based on how distance is defined between clusters.
the nature of the data and the desired clustering outcome.
21 22
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
• Feature Extraction: In this approach, you create new features that are combinations 5. Select Principal Components: Decide how many principal components you want
or transformations of the original features. Principal Component Analysis (PCA) and to retain in the lower-dimensional representation. This choice can be based on the
Linear Discriminant Analysis (LDA) are popular techniques for feature extraction. proportion of explained variance or specific requirements for dimensionality reduction.
They find linear combinations of the original features that capture the most significant 6. Projection: Use the selected principal components (eigenvectors) to transform the
variation in the data. data. Each data point is projected onto the subspace defined by these principal com-
Popular algorithms used for dimensionality reduction include principal component anal- ponents. This transformation results in a lower-dimensional representation of the data.
ysis (PCA).These algorithms seek to transform data from high-dimensional spaces to low- 7. Variance Explained: Calculate the proportion of total variance explained by the
dimensional spaces without compromising meaningful properties in the original data. These retained principal components. This information can help you assess the quality of the
techniques are typically deployed during exploratory data analysis (EDA) or data processing dimensionality reduction.
to prepare the data for modeling.
It’s helpful to reduce the dimensionality of a dataset during EDA to help visualize data: 8. Visualization and Analysis: Visualize and analyze the lower-dimensional data to
this is because visualizing data in more than three dimensions is difficult. From a data pro- gain insights, identify patterns, or facilitate further data analysis. Principal com-
cessing perspective, reducing the dimensionality of the data simplifies the modeling problem. ponents can be interpreted to understand the relationships between features in the
original data.
23 24
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
9. Inverse Transformation (Optional): If necessary, you can perform an inverse trans- References
formation to map the reduced-dimensional data back to the original high-dimensional
space. However, some information may be lost in this process. • Lecture Notes in MACHINE LEARNING, by Dr V N Krishnachandran
10. Application: Use the lower-dimensional data for various tasks, such as visualization, • Machine Learning by Amit Kumar Das Saikat Dutt, Subramanian Chandramouli
clustering, classification, or regression, with reduced computational complexity and • https://alexjungaalto.github.io/MLBasicsBook.pdf
noise.
• Taeho Jo Machine Learning Foundations Supervised, Unsupervised, and Advanced
PCA provides several benefits: Learning Springer book
• Dimensionality Reduction: By selecting a subset of principal components, you can • IIT Madras BS Degree Lectures and Notes
reduce the dimensionality of your data while retaining most of the variance. This is
especially useful when dealing with high-dimensional data. • NPTEL Lectures and Slides
• Noise Reduction: PCA can help filter out noise in the data, leading to cleaner and • www.medium.com
more interpretable patterns.
• geeksforgeeks.org/
• Visualization: PCA facilitates the visualization of data in lower dimensions, making
it easier to understand and explore complex datasets. • javatpoint.com/
25 26
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com