What is Clustering?
Clustering is an unsupervised machine learning technique that groups similar data points
together based on their similarities. Unlike classification, clustering does not require labeled
data. It is commonly used in pattern recognition, anomaly detection, and data
segmentation.
Key Characteristics of Clustering
1. Unsupervised Learning: No predefined labels or categories.
2. Similarity-Based Grouping: Groups data points based on similarity measures like
Euclidean distance, Cosine similarity, or Jaccard similarity.
3. Cluster Formation: Each group (cluster) should have high intra-cluster similarity and
low inter-cluster similarity.
4. Applications: Market segmentation, anomaly detection, image recognition,
document clustering, etc.
Types of Clustering Algorithms
1. Partition-Based Clustering
• Divides data into K clusters by optimizing a criterion (e.g., minimizing intra-cluster
distance).
• Example Algorithm: K-Means Clustering
• Strengths: Simple, efficient for large datasets.
• Limitations: Assumes spherical clusters, sensitive to initial cluster selection.
2. Hierarchical Clustering
• Builds a tree-like structure (dendrogram) to show how data points are grouped.
• Types:
o Agglomerative (Bottom-Up): Starts with individual points and merges them.
o Divisive (Top-Down): Starts with all points in one cluster and splits them
iteratively.
• Strengths: No need to predefine the number of clusters.
• Limitations: Computationally expensive for large datasets.
3. Density-Based Clustering
• Forms clusters based on data density, useful for irregularly shaped clusters.
• Example Algorithm: DBSCAN (Density-Based Spatial Clustering of Applications with
Noise)
• Strengths: Can detect clusters of arbitrary shape, robust to noise.
• Limitations: Struggles with varying density in data.
4. Grid-Based Clustering
• Divides the data space into grids and clusters based on density within each grid.
• Example Algorithm: STING (Statistical Information Grid-based Clustering)
• Strengths: Efficient for large datasets.
• Limitations: Grid size affects accuracy.
5. Model-Based Clustering
• Assumes data is generated by a mixture of probabilistic distributions.
• Example Algorithm: Gaussian Mixture Models (GMM)
• Strengths: Works well for complex data distributions.
• Limitations: Requires assumptions about data distribution.
Comparison of Clustering Methods
Clustering
Strengths Limitations
Method
Sensitive to outliers, assumes spherical
K-Means Fast, scalable
clusters
Hierarchical No need to predefine clusters Computationally expensive
Clustering
Strengths Limitations
Method
Detects arbitrarily shaped
DBSCAN Struggles with varying density
clusters
Grid-Based Efficient for large data Accuracy depends on grid size
GMM Handles complex distributions Requires probabilistic assumptions
Applications of Clustering
1. Customer Segmentation: Grouping customers based on buying behavior.
2. Anomaly Detection: Identifying fraud in financial transactions.
3. Image Segmentation: Grouping similar pixels for object detection.
4. Document Clustering: Organizing news articles or research papers.
K-Means Clustering
1. Overview
K-Means is a partition-based clustering algorithm that groups data points into K clusters by
minimizing intra-cluster variance. It is widely used due to its simplicity, speed, and
effectiveness in clustering large datasets.
2. How K-Means Works
Step-by-Step Process
1. Select K: Choose the number of clusters KKK.
2. Initialize Centroids: Randomly select KKK initial cluster centers (centroids).
3. Assign Data Points: Each data point is assigned to the nearest centroid using
Euclidean distance or other distance metrics.
4. Update Centroids: Recalculate the centroids by taking the mean of all points in each
cluster.
5. Repeat: Steps 3 and 4 are repeated until the centroids no longer change
(convergence) or a stopping condition is met.
3. Key Features
• Unsupervised Learning: Does not require labeled data.
• Iterative Optimization: Uses the Lloyd’s Algorithm to minimize variance.
• Sensitive to Initial Centroid Selection: Different initializations may result in different
cluster formations.
• Fast and Scalable: Efficient for large datasets.
4. Distance Metric Used
The most common metric is Euclidean Distance:
d(p,c)=∑i=1n(pi−ci)2d(p, c) = \sqrt{\sum_{i=1}^{n} (p_i - c_i)^2}d(p,c)=i=1∑n(pi−ci)2
where:
• ppp = Data point
• ccc = Cluster centroid
Other metrics like Manhattan Distance or Cosine Similarity can also be used based on the
problem type.
5. Choosing the Optimal Number of Clusters (K)
• Elbow Method: Plots the sum of squared errors (SSE) for different values of K. The
"elbow" point indicates the optimal K.
• Silhouette Score: Measures how well a data point fits within its cluster.
• Gap Statistic: Compares clustering performance to random distributions.
6. Advantages & Disadvantages
Advantages Disadvantages
Simple and easy to implement Requires predefining K
Fast and scalable for large datasets Sensitive to outliers
Works well with well-separated clusters Struggles with non-spherical clusters
7. Applications of K-Means
1. Customer Segmentation (e.g., grouping online shoppers based on behavior).
2. Image Compression (reducing colors in an image).
3. Anomaly Detection (detecting fraudulent transactions).
4. Document Clustering (categorizing news articles).
Agglomerative Hierarchical Clustering (AHC)
1. Overview
Agglomerative Hierarchical Clustering (AHC) is a bottom-up clustering method where each
data point starts as its own cluster and gradually merges with others until a single cluster is
formed. It builds a tree-like structure called a dendrogram, which helps visualize the
hierarchy of clusters.
2. How Agglomerative Clustering Works
Step-by-Step Process
1. Each Data Point as a Cluster: Initially, each data point is treated as an individual
cluster.
2. Compute Distances: Calculate pairwise distances between all clusters using a
distance metric (e.g., Euclidean, Manhattan, or Cosine similarity).
3. Merge Closest Clusters: The two clusters with the smallest distance are merged into
a new cluster.
4. Repeat: Steps 2 and 3 continue until all data points are merged into a single cluster.
5. Dendrogram Analysis: The final output is a dendrogram, which can be cut at a
specific level to define the desired number of clusters.
3. Linkage Criteria (Cluster Merging Strategies)
The way clusters are merged depends on the linkage criterion, which determines the
distance between clusters:
Linkage Type Description Best For
Merges clusters based on the smallest distance
Single Linkage Long, chain-like clusters
between points in two clusters.
Linkage Type Description Best For
Complete Merges clusters based on the largest distance Compact, well-
Linkage between points. separated clusters
Average Uses the average distance between all points in
Balanced clustering
Linkage two clusters.
Centroid Uses the distance between centroids of two
Clusters of similar size
Linkage clusters.
4. Distance Metrics Used
• Euclidean Distance: Measures straight-line distance.
• Manhattan Distance: Measures distance along axes.
• Cosine Similarity: Measures angle between vectors (useful for text data).
5. Advantages & Disadvantages
Advantages Disadvantages
No need to specify the number of
Computationally expensive for large datasets
clusters
Produces a dendrogram for better
Sensitive to noisy data and outliers
visualization
Difficult to decide the best level to cut the
Works well with non-spherical clusters
dendrogram
6. Applications of Agglomerative Clustering
• Biology: Genetic and protein sequence analysis.
• Marketing: Customer segmentation for targeted advertising.
• Social Networks: Identifying groups of similar users.
• Image Processing: Grouping similar objects in images.
Agglomerative Hierarchical Clustering (AHC)
1. Overview
Agglomerative Hierarchical Clustering (AHC) is a bottom-up clustering method where each
data point starts as its own cluster and gradually merges with others until a single cluster is
formed. It builds a tree-like structure called a dendrogram, which helps visualize the
hierarchy of clusters.
2. How Agglomerative Clustering Works
Step-by-Step Process
1. Each Data Point as a Cluster: Initially, each data point is treated as an individual
cluster.
2. Compute Distances: Calculate pairwise distances between all clusters using a
distance metric (e.g., Euclidean, Manhattan, or Cosine similarity).
3. Merge Closest Clusters: The two clusters with the smallest distance are merged into
a new cluster.
4. Repeat: Steps 2 and 3 continue until all data points are merged into a single cluster.
5. Dendrogram Analysis: The final output is a dendrogram, which can be cut at a
specific level to define the desired number of clusters.
3. Linkage Criteria (Cluster Merging Strategies)
The way clusters are merged depends on the linkage criterion, which determines the
distance between clusters:
Linkage Type Description Best For
Merges clusters based on the smallest distance
Single Linkage Long, chain-like clusters
between points in two clusters.
Complete Merges clusters based on the largest distance Compact, well-
Linkage between points. separated clusters
Average Uses the average distance between all points in
Balanced clustering
Linkage two clusters.
Centroid Uses the distance between centroids of two
Clusters of similar size
Linkage clusters.
4. Distance Metrics Used
• Euclidean Distance: Measures straight-line distance.
• Manhattan Distance: Measures distance along axes.
• Cosine Similarity: Measures angle between vectors (useful for text data).
5. Advantages & Disadvantages
Advantages Disadvantages
No need to specify the number of
Computationally expensive for large datasets
clusters
Produces a dendrogram for better
Sensitive to noisy data and outliers
visualization
Difficult to decide the best level to cut the
Works well with non-spherical clusters
dendrogram
6. Applications of Agglomerative Clustering
• Biology: Genetic and protein sequence analysis.
• Marketing: Customer segmentation for targeted advertising.
• Social Networks: Identifying groups of similar users.
• Image Processing: Grouping similar objects in images.
Cluster Evaluation: Overview
1. What is Cluster Evaluation?
Cluster evaluation measures the quality and effectiveness of a clustering algorithm. Since
clustering is an unsupervised learning technique, evaluation helps determine how well the
clusters represent the data structure.
2. Types of Cluster Evaluation Methods
A. Internal Evaluation Metrics (No Ground Truth Required)
These metrics assess the clustering quality without requiring labeled data.
Metric Description Best For
Measures how well each point fits within its Compact, well-separated
Silhouette Score
cluster vs. the nearest cluster. clusters
Ratio of the smallest inter-cluster distance to Identifying well-
Dunn Index
the largest intra-cluster distance. separated clusters
Davies-Bouldin Measures the similarity between clusters Evaluating compact and
Index (DBI) (lower is better). distinct clusters
B. External Evaluation Metrics (Ground Truth Required)
Used when true labels (actual cluster assignments) are available for comparison.
Metric Description Best For
Compares predicted vs. true clusters based Measuring clustering
Rand Index (RI)
on agreements and disagreements. accuracy
Adjusted Rand Index Balanced cluster
Adjusts RI for chance grouping of points.
(ARI) evaluation
Normalized Mutual Measures mutual dependence between Text and document
Information (NMI) predicted and true clusters. clustering
Fowlkes-Mallows Index Measures precision and recall between Evaluating true
(FMI) clusters. positives in clustering
C. Relative Evaluation Metrics (Comparing Different Models)
These metrics compare different clustering results on the same dataset.
Method Description
Elbow Method Uses Sum of Squared Errors (SSE) to find the optimal number of clusters.
Gap Statistic Compares clustering performance to a random reference model.
Silhouette Helps determine the optimal number of clusters by checking cluster
Analysis separation.
3. Challenges in Cluster Evaluation
• Defining "Good" Clusters: No single metric defines perfect clustering.
• Handling Noise & Outliers: Outliers can affect evaluation scores.
• Choosing the Right Metric: Different algorithms and datasets require different
evaluation methods.
4. Applications of Cluster Evaluation
• Customer Segmentation: Ensuring clusters accurately represent customer groups.
• Anomaly Detection: Identifying true vs. false anomalies.
• Genomic Data Analysis: Evaluating gene expression clusters.
Unsupervised Cluster Evaluation Using Cohesion and Separation
1. What is Cluster Evaluation in Unsupervised Learning?
Since clustering is an unsupervised learning method, there are no predefined labels to
validate the results. To assess the quality of clusters, we use cohesion and separation
measures. These metrics ensure that points within a cluster are similar (cohesion) and
clusters are well-separated from each other (separation).
2. Key Metrics for Cohesion and Separation
A. Cohesion (Intra-cluster Similarity)
Cohesion measures how tightly packed the points are within a cluster.
• A lower cohesion value indicates that points within a cluster are close to each other,
meaning a compact cluster.
• Formula: Cohesion=∑i∈Ckd(i,ck)Cohesion = \sum_{i \in C_k} d(i, c_k)Cohesion=i∈Ck∑
d(i,ck) where:
o d(i,ck)d(i, c_k)d(i,ck) is the distance between a point iii and the cluster
centroid ckc_kck.
o CkC_kCk is the set of points in cluster kkk.
Common Metrics for Measuring Cohesion
1. Sum of Squared Errors (SSE): Measures the sum of squared distances between each
point and its cluster center.
2. Within-Cluster Sum of Squares (WCSS): Used in the Elbow Method to find the
optimal number of clusters.
3. Compactness Measure: Smaller distances within a cluster indicate better cohesion.
B. Separation (Inter-cluster Difference)
Separation measures how far apart different clusters are from each other.
• A higher separation value indicates that clusters are well-distinguished.
• Formula: Separation=∑k=1Kd(ck,cm),k≠mSeparation = \sum_{k=1}^{K} d(c_k, c_m),
\quad k \neq mSeparation=k=1∑Kd(ck,cm),k =m where d(ck,cm)d(c_k, c_m)d(ck,cm)
is the distance between the centroids of different clusters kkk and mmm.
Common Metrics for Measuring Separation
1. Between-Cluster Sum of Squares (BCSS): Measures how far apart cluster centroids
are.
2. Dunn Index: Ratio of the smallest inter-cluster distance to the largest intra-cluster
distance.
3. Davies-Bouldin Index (DBI): Measures the ratio of intra-cluster distances to inter-
cluster distances (lower is better).
3. Combining Cohesion and Separation
Silhouette Score
The Silhouette Score combines both cohesion and separation into a single metric. It is
calculated as:
S=b−amax(a,b)S = \frac{b - a}{\max(a, b)}S=max(a,b)b−a
where:
• aaa = average distance of a point to other points within the same cluster (cohesion).
• bbb = average distance of a point to points in the nearest neighboring cluster
(separation).
• SSS ranges from -1 to 1:
o S≈1S \approx 1S≈1 → Well-clustered data.
o S≈0S \approx 0S≈0 → Overlapping clusters.
o S≈−1S \approx -1S≈−1 → Misclassified points.
4. Practical Applications of Cohesion & Separation
• Customer Segmentation: Ensuring compact, well-separated customer groups.
• Image Segmentation: Ensuring objects in an image are properly clustered.
• Anomaly Detection: Detecting unusual patterns in financial transactions.
Unsupervised Cluster Evaluation Using Cohesion and Separation
1. What is Cluster Evaluation in Unsupervised Learning?
Since clustering is an unsupervised learning method, there are no predefined labels to
validate the results. To assess the quality of clusters, we use cohesion and separation
measures. These metrics ensure that points within a cluster are similar (cohesion) and
clusters are well-separated from each other (separation).
2. Key Metrics for Cohesion and Separation
A. Cohesion (Intra-cluster Similarity)
Cohesion measures how tightly packed the points are within a cluster.
• A lower cohesion value indicates that points within a cluster are close to each other,
meaning a compact cluster.
• Formula: Cohesion=∑i∈Ckd(i,ck)Cohesion = \sum_{i \in C_k} d(i, c_k)Cohesion=i∈Ck∑
d(i,ck) where:
o d(i,ck)d(i, c_k)d(i,ck) is the distance between a point iii and the cluster
centroid ckc_kck.
o CkC_kCk is the set of points in cluster kkk.
Common Metrics for Measuring Cohesion
1. Sum of Squared Errors (SSE): Measures the sum of squared distances between each
point and its cluster center.
2. Within-Cluster Sum of Squares (WCSS): Used in the Elbow Method to find the
optimal number of clusters.
3. Compactness Measure: Smaller distances within a cluster indicate better cohesion.
B. Separation (Inter-cluster Difference)
Separation measures how far apart different clusters are from each other.
• A higher separation value indicates that clusters are well-distinguished.
• Formula: Separation=∑k=1Kd(ck,cm),k≠mSeparation = \sum_{k=1}^{K} d(c_k, c_m),
\quad k \neq mSeparation=k=1∑Kd(ck,cm),k =m where d(ck,cm)d(c_k, c_m)d(ck,cm)
is the distance between the centroids of different clusters kkk and mmm.
Common Metrics for Measuring Separation
1. Between-Cluster Sum of Squares (BCSS): Measures how far apart cluster centroids
are.
2. Dunn Index: Ratio of the smallest inter-cluster distance to the largest intra-cluster
distance.
3. Davies-Bouldin Index (DBI): Measures the ratio of intra-cluster distances to inter-
cluster distances (lower is better).
3. Combining Cohesion and Separation
Silhouette Score
The Silhouette Score combines both cohesion and separation into a single metric. It is
calculated as:
S=b−amax(a,b)S = \frac{b - a}{\max(a, b)}S=max(a,b)b−a
where:
• aaa = average distance of a point to other points within the same cluster (cohesion).
• bbb = average distance of a point to points in the nearest neighboring cluster
(separation).
• SSS ranges from -1 to 1:
o S≈1S \approx 1S≈1 → Well-clustered data.
o S≈0S \approx 0S≈0 → Overlapping clusters.
o S≈−1S \approx -1S≈−1 → Misclassified points.
4. Practical Applications of Cohesion & Separation
• Customer Segmentation: Ensuring compact, well-separated customer groups.
• Image Segmentation: Ensuring objects in an image are properly clustered.
• Anomaly Detection: Detecting unusual patterns in financial transactions.
Unsupervised Cluster Evaluation Using Cohesion and Separation with a Proximity Matrix
1. What is a Proximity Matrix?
A proximity matrix is a square matrix that stores the pairwise distances (or similarities)
between data points. It is essential for evaluating cohesion (intra-cluster similarity) and
separation (inter-cluster dissimilarity) in clustering.
• Distance-based Proximity Matrix → Uses Euclidean, Manhattan, or other distance
metrics.
• Similarity-based Proximity Matrix → Uses Cosine similarity or correlation.
2. Cohesion (Intra-cluster Similarity) Using a Proximity Matrix
Cohesion measures how close the points within a cluster are to each other.
Calculation Steps
1. Extract intra-cluster distances from the proximity matrix.
2. Compute the average intra-cluster distance for each cluster.
3. Sum over all clusters to get overall cohesion.
Formula
Cohesion=1∣Ck∣∑i,j∈CkP(i,j)Cohesion = \frac{1}{|C_k|} \sum_{i, j \in C_k} P(i, j)Cohesion=∣Ck
∣1i,j∈Ck∑P(i,j)
where:
• P(i,j)P(i, j)P(i,j) is the proximity value (distance or similarity) between points iii and jjj.
• CkC_kCk is cluster kkk.
Example (Proximity Matrix for a 4-Point Cluster)
P1 P2 P3 P4
P1 0 2 3 5
P2 2 0 1 4
P3 3 1 0 2
P4 5 4 2 0
• The sum of intra-cluster distances is (2+3+5+1+4+2) = 17
• If there are 4 points, the average cohesion is 17/4=4.2517/4 = 4.2517/4=4.25
3. Separation (Inter-cluster Dissimilarity) Using a Proximity Matrix
Separation measures how far apart different clusters are.
Calculation Steps
1. Extract inter-cluster distances from the proximity matrix.
2. Compute the average distance between centroids of different clusters.
3. Sum over all cluster pairs.
Formula
Separation=1K(K−1)∑k≠md(Ck,Cm)Separation = \frac{1}{K(K-1)} \sum_{k \neq m} d(C_k,
C_m)Separation=K(K−1)1k =m∑d(Ck,Cm)
where d(Ck,Cm)d(C_k, C_m)d(Ck,Cm) is the average distance between clusters kkk and
mmm.
Example (Inter-cluster Proximity Matrix)
C1 C2 C3
C1 0 8 10
C2 8 0 12
C3 10 12 0
• Average separation = (8+10+12)/3=10(8+10+12) / 3 = 10(8+10+12)/3=10
4. Combining Cohesion and Separation for Evaluation
A good clustering result has:
Low cohesion → Clusters are compact.
High separation → Clusters are well-separated.
Silhouette Score Using Proximity Matrix
S=b−amax(a,b)S = \frac{b - a}{\max(a, b)}S=max(a,b)b−a
where:
• aaa = average intra-cluster proximity (cohesion).
• bbb = average nearest inter-cluster proximity (separation).
If S > 0.5, the clustering is considered good.
5. Applications of Proximity Matrix-Based Cluster Evaluation
• Customer Segmentation: Finding compact, well-separated groups.
• Fraud Detection: Distinguishing normal and fraudulent transactions.
• Document Clustering: Grouping similar documents together.
Scalable Clustering Algorithms
1. What is Scalable Clustering?
Scalable clustering algorithms are designed to handle large datasets efficiently by reducing
computational complexity and memory usage. These algorithms must:
Work with big data (millions or billions of points).
Handle high-dimensional spaces efficiently.
Minimize memory and time complexity.
Support parallel or distributed computing for faster execution.
2. Types of Scalable Clustering Algorithms
A. Partitioning-Based Algorithms (Fast & Efficient)
These methods divide data into k clusters and refine them iteratively.
1. K-Means++ (Improved K-Means)
• Selects better initial centroids to avoid poor clustering.
• Time Complexity: O(nk)O(nk)O(nk) (linear).
• Scalability: Efficient for millions of points but struggles with very high-dimensional
data.
2. Mini-Batch K-Means
• Uses random mini-batches instead of the full dataset.
• Time Complexity: O(n)O(n)O(n) (faster than K-Means).
• Scalability: Suitable for streaming data and large-scale clustering.
B. Density-Based Algorithms (Detects Arbitrary-Shaped Clusters)
Focus on areas of high density and separate noise.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
• Finds clusters of varying shapes based on density.
• Time Complexity: O(nlogn)O(n \log n)O(nlogn).
• Scalability: Limited for very large datasets but great for outlier detection.
4. HDBSCAN (Hierarchical DBSCAN)
• An improved version of DBSCAN that handles variable density clusters.
• Time Complexity: O(nlogn)O(n \log n)O(nlogn).
• Scalability: Works well for large-scale clustering in complex datasets.
C. Hierarchical Clustering Algorithms (Scalable with Approximation)
5. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
• Uses a tree structure (CF Tree) for clustering large datasets efficiently.
• Time Complexity: O(n)O(n)O(n) (linear).
• Scalability: Can process millions of data points efficiently.
6. CURE (Clustering Using Representatives)
• Uses a subset of representative points to define clusters.
• Time Complexity: O(nlogn)O(n \log n)O(nlogn).
• Scalability: Handles large datasets with non-spherical clusters well.
D. Grid-Based Clustering (Divides Data into a Grid)
7. STING (Statistical Information Grid)
• Uses a grid structure instead of raw data points.
• Time Complexity: O(n)O(n)O(n) (fast).
• Scalability: Good for high-dimensional and large datasets.
8. CLIQUE (Clustering in High-Dimensional Space)
• Works well in high-dimensional datasets (e.g., gene expression).
• Time Complexity: O(n)O(n)O(n).
• Scalability: Handles big data in high dimensions.
E. Streaming & Distributed Clustering
9. Streaming K-Means (Online K-Means)
• Continuously updates clusters with new data points.
• Time Complexity: O(n)O(n)O(n) (linear).
• Scalability: Best for real-time clustering (e.g., IoT, finance).
10. Apache Spark MLlib Clustering
• Supports parallel clustering for big data.
• Algorithms: K-Means, DBSCAN, and Gaussian Mixture Models (GMM).
• Scalability: Used for terabyte-scale clustering in cloud computing.
3. Choosing the Right Scalable Clustering Algorithm
Algorithm Data Size Shape of Clusters Noise Handling Speed
K-Means++ Large Spherical No Fast
Mini-Batch K-Means Very Large Spherical No Very Fast
DBSCAN Medium Arbitrary Yes Medium
HDBSCAN Large Arbitrary Yes Medium
BIRCH Large Hierarchical No Fast
CURE Large Arbitrary Yes Medium
STING Very Large Grid No Very Fast
CLIQUE Very Large Grid & High-Dimensional Yes Fast
Streaming K-Means Continuous Data Spherical No Fast
Spark MLlib Extremely Large Any Yes Very Fast
4. Applications of Scalable Clustering
• Customer Segmentation (Millions of customers → Mini-Batch K-Means)
• Anomaly Detection (Cybersecurity logs → DBSCAN)
• Genomic Data Analysis (High-dimensional gene data → CLIQUE)
• Real-time Traffic Analysis (Streaming IoT data → Streaming K-Means)
• Big Data Analytics (Cloud computing → Spark MLlib)