Unsupervised Learning
• SYLLABUS: Unsupervised Learning: Introduction, Unsupervised vs Supervised
Learning, Application of Unsupervised Learning, Clustering K-Means, K-
Medoid, DBSCAN.
Unsupervised Learning: Introduction
• Unsupervised learning is a machine learning concept where the unlabelled and
unclassified information is analysed to discover hidden knowledge.
• The goal is to infer the natural structure present within a set of data points. Unlike
supervised learning, there are no predefined labels or outcomes associated with the
data.
• Unsupervised learning can be used to discover patterns, groupings, and associations
within the data.
• Unsupervised learning objective is to observe only the features X1:, X2:,…,Xn, we are
not going to predict any outcome variable, but rather our intention is to find out the
association between the features or their grouping to understand the nature of the
data.
Unsupervised Learning: Introduction
• To understand the concept of unsupervised learning, two methods need to be
explained - Clustering and Association Analysis.
• Clustering is a broad class of methods used for discovering unknown subgroups
in data, which is the most important concept in unsupervised learning.
• Association Analysis is a technique used to captures the key patterns or
relationships (also called association rules) that explain the differences between
data points.
Unsupervised Learning: Introduction
• Types of Unsupervised Learning
1. Clustering: Grouping data points into clusters based on their similarities.
2. Dimensionality Reduction: Reducing the number of features while
preserving the significant information.
3. Anomaly Detection: Identifying rare items or events which differ
significantly from the majority of the data.
4. Association: Discovering interesting relationships between variables in large
databases.
Unsupervised Learning: Introduction
• List of Unsupervised Learning Algorithms
1. Clustering Algorithms:
1. K-Means Clustering
2. Hierarchical Clustering
3. DBSCAN (Density-Based Spatial Clustering of Applications with
Noise)
4. Gaussian Mixture Models (GMM)
5. Mean Shift Clustering
Unsupervised Learning: Introduction
2. Dimensionality Reduction Algorithms:
1. Principal Component Analysis (PCA)
2. t-Distributed Stochastic Neighbour Embedding (t-SNE)
3. Linear Discriminant Analysis (LDA)
4. Independent Component Analysis (ICA)
5. Auto encoders
Unsupervised Learning: Introduction
3. Anomaly Detection Algorithms:
• Isolation forest
• Local Outlier Factor (LOF)
• One-Class SVM
4. Association Rule Learning Algorithms:
1. Apriori Algorithm
2. Eclat Algorithm
APPLICATION OF UNSUPERVISED LEARNING
• There are many domains where unsupervised learning finds its
application.
• Few examples of such applications are as follows:
1. Segmentation of target consumer populations by an advertisement
consulting agency on the basis of few dimensions such as
demography, financial data, purchasing habits, etc.
2. Anomaly or fraud detection in the banking sector by identifying the
pattern of loan defaulters.
APPLICATION OF UNSUPERVISED LEARNING
3. Image processing and image segmentation such as face recognition,
expression identification, etc.
4. Grouping of important characteristics in genes to identify important
influencers in new areas of genetics.
5. Document Clustering: Grouping similar documents for organizing large
collections of text data.
• Chat bots, self-driven cars, and many more recent innovations are results
of the combination of unsupervised and supervised learning.
CLUSTERING
• Clustering is a method for sorting data into groups, called clusters, based
on their characteristics.
• The goal is to group items that are similar to each other within the same
cluster, while ensuring that items in different clusters are as different
from each other as possible.
• The effectiveness of clustering depends on how similar or related the
objects within a group are or how different or unrelated the objects in
different groups are from each other.
CLUSTERING
• For example, consider a set of customer data.
• Clustering might group customers with similar buying habits together.
• So, one cluster might be people who frequently buy sports equipment,
while another cluster might be people who mostly buy books.
• This helps in understanding distinct customer types and tailoring
marketing strategies accordingly.
Application areas of cluster analysis
• Customer segmentation: creating clusters of customers on the basis of
parameters such as demographics, financial conditions, buying habits, etc.,
• Data Mining involves extracting useful information from large datasets, we often
group similar features together.
• Text data mining: this includes tasks such as text categorization, text clustering,
document summarization, concept extraction, sentiment analysis, and entity
relation modelling.
• Anomaly checking: checking of anomalous behaviors such as fraudulent bank
transaction, unauthorized computer intrusion, suspicious movements on a radar
scanner, etc.
Application areas of cluster analysis
• The methods related to the machine learning task of clustering :
1. How clustering tasks differ from classification tasks and how clustering
defines groups.
2. A classic and easy-to-understand clustering algorithm, namely k-means,
which is used for clustering along with the k-medoids algorithm.
1. Clustering as a machine learning task
• The knowledge of clustering is primarily motivated by discovery rather
than on prediction.
• So, clustering is defined as an unsupervised machine learning task that
automatically divides the data into clusters or groups of similar items.
• In the clustering task the data inside a cluster should be very similar to
each other but very different from those outside the cluster.
• Using this principle, whenever a large set of diverse and varied data is
presented for analysis, clustering enables to represent the data in a
smaller number of groups.
1. Clustering as a machine learning task
• Hence, the effectiveness of clustering task is measured by the
homogeneity within a group as well as the difference between distinct
groups.
1. Clustering as a machine learning task
• How clustering tasks differ from classification tasks ?
1. Labeled vs. Unlabeled Data:
Clustering is an unsupervised learning method. It works with unlabeled
data—there’s no predefined category or label assigned to the data points.
Classification is a supervised learning method. It works with labeled data—
each data point has a known category or label.
2. Objective:
• Clustering aims to discover natural groupings or patterns in the data without any
prior knowledge.
Classification aims to assign predefined labels to data points based on learned
characteristics.
1. Clustering as a machine learning task
3. Output:
• Clustering produces clusters, or groups of similar data points. The clusters
are often labeled arbitrarily (like Cluster 1, Cluster 2, etc.), as there’s no
predefined label.
• Classification produces specific class labels for each data point.
4. Evaluation:
• Clustering is harder to evaluate because there’s no ground truth; metrics like
silhouette score or intra-cluster distance are used.
• Classification can be evaluated with accuracy metrics (e.g., accuracy,
precision, recall, F1-score).
Different types of clustering techniques
• The major clustering techniques are :
1. Partitioning methods,
2. Hierarchical methods and,
3. Density-based methods.
• The approach towards creating the clusters is same. However, the way to
measure the quality of the clusters, and applicability are different.
• Table below summarizes the main characteristics of each method.
Different types of clustering techniques
Partitioning methods
• Two of the most important algorithms for partitioning-based clustering are k-
means and k-medoid.
• In the k-means algorithm, the centroid of the prototype is identified for
clustering, which is normally the mean of a group of points.
• Similarly, the k-medoid algorithm identifies the medoid which is the most
representative point for a group of points.
• 1. K-means - A centroid-based technique
• The principle of the k-means algorithm is to assign each of the ‘n’ data points to
one of the K clusters where ‘K’ is a user defined parameter as the number of
clusters desired.
Partitioning methods
• The objective is to maximize the homogeneity within the clusters and also
to maximize the differences between the clusters.
• The homogeneity and differences are measured in terms of the distance
between the objects or points in the data set.
• Algorithm shows the simple algorithm of K-means
• Step 1: Select K points in the data space and mark them as initial
centroids
K-means - A centroid-based technique
• loop
• Step 2: Assign each point in the data space to the nearest centroid to form K
clusters.
• Step 3: Measure the distance of each point in the cluster from the centroid.
• Step 4: Calculate the Sum of Squared Error (SSE) to measure the quality of the
clusters.
• Step 5: Identify the new centroid of each cluster on the basis of distance between
points.
• Step 6: Repeat Steps 2 to 5 to refine until centroids do not change.
• end loop
K-means - A centroid-based technique
• Strengths and Weaknesses of K-means algorithm
K-means - A centroid-based technique
• Types of K-Means
1. Standard K-Means: The basic form of the algorithm described above.
2. K-Means++: An improved initialization technique that spreads out the
initial centroids, reducing the chances of poor clustering results.
3. Mini-Batch K-Means: A variation that processes small random batches of
the dataset, which improves efficiency and scalability for large datasets.
K-means - A centroid-based technique
• Example
• Consider a set of data points as shown in
figure.
• Apply k-means algorithm to find out the
clusters generated from the given data set.
• Let us fix K = 4, implying that we want to
create four clusters out of this data set.
• As the first step, we assign four random
points from the data set as the centroids, as
represented by the * signs.
K-means - A centroid-based technique
• Assign the data points to the nearest centroid
to create four clusters.
• In the second step, on the basis of the distance
of the points from the corresponding centroids,
the centroids are updated and points are
reassigned to the updated centroids.
• After three iterations, we found that the
centroids are not moving as there is no scope
for refinement, and thus, the k-means
algorithm will terminate.
K-means - A centroid-based technique
• How to select the correct number of clusters ?
• The performance of the K-means clustering algorithm depends upon highly
efficient clusters that it forms.
• But choosing the optimal number of clusters is a big task.
• There are some different ways to find the optimal number of clusters
• 1. For a small data set, sometimes a rule of thumb that is followed is:
• K is set as the square root of n/2 for a data set of n examples.
• But unfortunately, this thumb rule does not work well for large data sets.
K-means - A centroid-based technique
• 2. Elbow method
• This method tries to measure the homogeneity or heterogeneity within the
cluster and for various values of ‘K’ and helps in arriving at the optimal ‘K’.
• This method uses the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total variations within a cluster.
• The formula to calculate the value of WCSS (for 3 clusters) is given below:
K-means - A centroid-based technique
• It is the sum of the square of the distances between each data point and
its centroid within a cluster.
• To measure the distance between data points and centroid, we can use
Euclidean distance formula.
• To find the optimal value of clusters, the elbow method follows the below
steps:
• It executes the K-means clustering on a given dataset for different K
values (ranges from 1-10).
• For each value of K, calculate the WCSS value.
K-means - A centroid-based technique
• Plots a curve between calculated WCSS
values and the number of clusters K.
• The sharp point of bend or a point of
the plot looks like an arm, then that
point is considered as the best value of
K.
• The graph shows the sharp bend,
which looks like an elbow, hence it is
known as the elbow method.
K-means - A centroid-based technique
• Example :
• Suppose we have two variables M1
and M2. The x-y axis scatter plot of
these two variables is given in
Figure.
• Number of clusters, i.e., K=2 is
decided, to identify the dataset and to
put them into different clusters.
• It means here we need to group these
datasets into two different clusters as
shown in Figure.
K-means - A centroid-based technique
Real-Time Applications of K-means algorithm
• Customer Segmentation
• Image Compression
• Document Clustering
• Market Segmentation
• Anomaly Detection
K-Medoids: a representative object-based technique
• In K-means algorithm it is a common practice to run the k-means algorithm
multiple times with different cluster centers to identify the optimal clusters.
• The necessity to set the initial ‘K’ values is perceived as a disadvantage of the k-
means algorithm.
• The k-means algorithm is sensitive to outliers in the data set and inadvertently
produces skewed clusters.
• k-medoids provides a solution to the above problems.
• Instead of considering the mean of the data points in the cluster, k medoids
considers k representative data points from the existing points in the data set
as the center of the clusters.
K-Medoids: a representative object-based technique
• It then assigns the data points according to their distance from these centers to
form k clusters.
• A medoid is defined as the object of a cluster, whose average dissimilarity to all
the objects in the cluster is minimal.
• K-Medoid Algorithm:
1. Initialization:
• Choose k initial medoids randomly from the dataset.
2. Assign Points to Medoids:
• Assign each data point to the nearest medoid based on a chosen distance
metric (e.g., Euclidean distance, Manhattan distance).
K-Medoids: a representative object-based technique
3. Update Medoids:
• For each cluster, find the data point within the cluster that minimizes the total distance to
all other points in that cluster. This point becomes the new medoid.
4. Reassign Points:
• Reassign points to the nearest medoid based on the updated medoids.
5. Repeat:
• Repeat steps 3 and 4 until the medoids stabilize (no changes) or a stopping criterion is met
(e.g., maximum iterations).
6. Output:
• The final medoids and their respective clusters.
K-Medoids: a representative object-based technique
• Example K-Medoids Clustering Algorithm
1. Take an example of dataset
2. Apply K-Medoid Clustering algorithm to form two clusters
3. Use Manhattan Distance to find the between data point and medoid
I X Y
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6
Example K-Medoids Clustering Algorithm
• STEP1:
1. Select Two Mediods Randomly
2. C1= (3, 4), C2= (7,4)
3. (x1, y1) and (x2, y2) are data points
4. Manhattan Distance formula = |x1-x2| + |y1-y2|
5. Manhattan Distance C1 [ (2,6), (3,4)] = |2-3| + |6-4| = 3
6. Manhattan Distance C2 [ (2,6), (7,4)] = |2-7| + |6-4| = 7
7. Sincec C1 < C2, so data point (2, 6) will assigned to C1 cluster.
8. Similarly, we need to find up to X10 Values
Example K-Medoids Clustering Algorithm
I X Y C1 C2 CLUSTER
X1 2 6 3 7 C1
X2 3 4 0 4 C1
X3 3 8 4 8 C1
X4 4 7 4 6 C1
X5 6 2 5 3 C2
X6 6 4 3 1 C2
X7 7 3 5 1 C2
X8 7 4 4 0 C2
X9 8 5 6 2 C2
X10 7 6 6 2 C2
Calculating the values with respect to Cluster 1 values (3,4), Cluster 2 values
(7,4), Finding the cluster group based on C1 & C2 taken minimum values.
Example K-Medoids Clustering Algorithm
Step 9. Clusters are
a. C1 = {(2,6, (3,4), (3,8), (4,7)}
b. C2 = {(6,2), (6,4), (7,3), (7,4), (8,5), (7,6)}
c. Calculate the Cost (C,X) = ∑𝒊 |𝒄𝒊 − 𝒙𝒊|
d.Total Cost = {(Cost (3,4), (2,6)) + {(Cost (3,4), (3,4)) + {(Cost (3,4), (3,8)) +
{(Cost (3,4), (4,7)) + (Cost (7,4), (6,2)) + (Cost (7,4), (6,4)) + (Cost (7,4), (7,3))
+(Cost (7,4), (7,4)) + (Cost (7,4), (8,5)) +(Cost (7,4), (7,6))}
Total cost = 3+4+4+2+3+1+1+2 = 20
Example K-Medoids Clustering Algorithm
• Step 10
1. Select Two Mediods Randomly
2. New Medoids are - C1= (3, 4), P2= (7,3)
a. Here the new medoids are C1 & P2
b. The C1 Values are Same, P2 values only changed
3. Previous Medoids are - C1= (3, 4), C2= (7,4)
4. (x1, y1) and (x2, y2) are data points
5. Manhattan Distance formula = |x1-x2| + |y1-y2|
a. Where the values are x1=2, y1=6, x2=7, y2=3 similarly we need to find all the
values and need to substitute the values in formula.
6. Manhattan Distance [ (2,6), (7,3)] = |2-7| + |6-3| = 8
a. Manhattan Distance [ (3,4), (7,3)] = |3-7| + |4-3| = 5
b. Manhattan Distance [ (3,8), (7,3)] = |3-7| + |8-3| = 9
c. Similarly, we need to find up to X10 Values
Example K-Medoids Clustering Algorithm
I X Y C1 P CLUSTER
FIND
X1 2 6 3 8 C1
X2 3 4 0 5 C1
X3 3 8 4 9 C1
X4 4 7 4 7 C1
X5 6 2 5 2 P
X6 6 4 3 2 P
X7 7 3 5 0 P
X8 7 4 4 1 P
X9 8 5 6 3 P
X10 7 6 6 3 P
Example K-Medoids Clustering Algorithm
Step 11. New Clusters are
a. C1 = {(2,6, (3,4), (3,8), (4,7)}
b. C2 = {(6,2), (6,4), (7,3), (7,4), (8,5), (7,6)}
c. Calculate the Cost (C,X) = ∑𝒊 |𝒄𝒊 − 𝒙𝒊|
d.Total Cost = {(Cost (3,4), (2,6)) + {(Cost (3,4), (3,4)) + {(Cost (3,4), (3,8)) +
{(Cost (3,4), (4,7)) + (Cost (7,3), (6,2)) + (Cost (7,3), (6,4)) + (Cost (7,3), (7,3))
+(Cost (7,3), (7,4)) + (Cost (7,3), (8,5)) +(Cost (7,3), (7,6))}
Current Total Cost = 3+4+4+2+2+1+3+3= 22
Example K-Medoids Clustering Algorithm
• STEP 12:
1. Cost of Swapping of Medoid C2 with P
2. S= Current Total Cost – Previous Total Cost S= 22-20 = 2>0
• In K-Mediod clustering algorithm, we find the Current Total Cost =22 and
Previous Total Cost = 20.
• Here the problem is the current total cost greater than the previous total cost,
So finally replacing C2 with P is not good idea, so we will go for final Mediods
are C1= (3,4), and C2= (7,4).
DBSCAN clustering
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular
clustering algorithm designed to identify clusters of varying shapes and sizes within
a dataset.
• It relies on density criteria to form clusters, distinguishing itself by effectively
handling noise (outliers).
• How It Works
1. Parameters:
o Epsilon (ε): The maximum distance between two points for them to be
considered neighbors.
o Min Pts: The minimum number of points required to form a dense region (a
cluster).
DBSCAN clustering
2. Core Points, Border Points, and Noise:
o Core Point: A point is a core point if it has at least MinPts points within
its ε- neighborhood.
o Border Point: A point is a border point if it has fewer than MinPts points
within its ε- neighborhood but is within the ε-neighborhood of a core point.
o Noise: Points that are neither core points nor border points are classified as
noise.
DBSCAN clustering
Algorithm Steps:
1. Randomly select a point.
2. If the point is a core point, create a new cluster. Expand the cluster by adding
all points that are density-reachable from the core point (directly or indirectly
connected via other core points).
3. If the point is a border point and not part of any cluster, it is assigned to the
nearest cluster.
4. Continue until all points are processed.
DBSCAN clustering
• Advantages
1. Ability to Find Arbitrarily Shaped Clusters: Unlike k-means, DBSCAN can
find clusters of arbitrary shapes and sizes, making it suitable for more complex
datasets.
2. No Need to Specify Number of Clusters: Unlike k-means, DBSCAN does not
require the number of clusters to be specified beforehand.
3. Robust to Noise: Effectively identifies outliers as noise points, improving the
quality of clustering results.
4. Minimal Parameters: Only requires two parameters (ε and MinPts),
simplifying the model tuning process.
DBSCAN clustering
• Limitations
1. Difficulty in Parameter Selection:
o Choosing appropriate values for ε and MinPts can be challenging and may
require domain knowledge or experimentation.
2. Inefficiency with Varying Density:
• Struggles with datasets containing clusters of varying densities, as the same
ε and MinPts values may not work well for all clusters.
DBSCAN clustering
3. Computational Complexity:
o DBSCAN has a time complexity of O(n log n)with spatial indexing
structures, but without such optimizations, it can be O(n2) making
it less efficient for very large datasets.
4. High-Dimensional Data:
o The performance of DBSCAN degrades with high-dimensional
data because distance metrics become less meaningful in high-
dimensional spaces, leading to poor clustering results.
DBSCAN clustering
• Real-Time Applications
1. Geospatial Data Analysis:
o Application: Identifying geographical clusters, such as hotspots of disease
outbreaks or crime areas.
o Example: Analyzing spatial distribution of earthquakes to detect regions
with high seismic activity.
2. Market Segmentation:
o Application: Clustering customers based on purchasing behavior to identify
distinct market segments.
• Example: Segmenting customers of an e-commerce platform based on their
browsing and purchasing patterns.
DBSCAN clustering
3. Anomaly Detection:
o Application: Detecting anomalies in data, such as fraud detection in
financial transactions or identifying network intrusions.
o Example: Identifying unusual patterns in network traffic that may
indicate security breaches.
4. Image Processing:
o Application: Clustering similar images or regions within images for tasks
like image segmentation.
o Example: Grouping pixels in satellite images to identify different land cover
types.