Introduction to Machine Learning
- Prof. Balaraman Ravindran | IIT Madras
Problem Solving Session (Week-10)
Shreya Bansal
PMRF PhD Scholar
IIT Ropar
Week-10 Contents
1. Partitional Clustering (K-means)
2. Hierarchical Clustering (Agglomerative)
3. Birch Algorithm
4. Cure Algorithm
5. Density Based Clustering (DB-SCAN)
What is Clustering?
What is Clustering?
● Definition: Grouping similar data points together.
● Goal: Maximize intra-cluster similarity.
Minimize inter-cluster similarity.
● Formal Problem: Partition n data points into k clusters.
● Challenge: Number of possible partitions is huge
(combinatorial).
Applications of Clustering
● Data Mining: Categorization of unlabeled data.
● Pre-processing: Reduce large datasets to representative
points (e.g., 10M → 10k clusters).
● Visualization: Understand structure in high-dimensional
feature spaces.
● Anomaly Detection: Identify outliers (e.g., fraud detection).
● Example:
● Customer segmentation: Group users by behavior.
Clustering Algorithms Overview
● Three Main Approaches:
○ Partitional (e.g., K-means, K-medoids).
○ Hierarchical (e.g., Agglomerative).
○ Density-Based (e.g., DBSCAN).
● Key Difference:
● Partitional methods search directly for the final k-partition.
K-Means Clustering
Steps:
1. Initialize k centroids
(randomly or
heuristically).
2. Assign each point to
the nearest centroid.
3. Recompute centroids
as cluster means.
4. Repeat until
convergence.
K-Means Clustering
Steps:
1. Initialize k centroids (randomly or heuristically).
2. Assign each point to the nearest centroid.
3. Recompute centroids as cluster means.
4. Repeat until convergence.
Pros: Simple, fast.
Cons: Sensitive to initialization; assumes spherical clusters.
K-Means Initialization Matters!
● Problem: Random
seeds can lead to poor
local optima.
● Example:
● Bad initialization →
Uneven clusters.
● Solution: K-means++
(smart seeding).
K-Medoids and PAM
● K-Medoids:
○ Uses actual data points as centroids (medoids).
○ Robust to outliers (median-like behavior).
● PAM (Partitioning Around Medoids):
○ Swap medoids with non-medoids.
○ Keep swaps that improve cluster quality.
● Trade-off: Computationally expensive (O(n2)
How to Choose k?
● Methods:
● Domain Knowledge: Predefined k (e.g., 5
customer segments).
● Elbow Method: Plot k vs. distortion (sum
of squared distances).
● Pick k at the "elbow" of the curve.
Cluster Evaluation Metrics
● Diameter: Max/Avg pairwise distance in a cluster.
● Radius: Avg distance to centroid.
● Purity: % of dominant class in a cluster (if labels exist).
● Rand Index: Agreement with reference clustering.
● Formula:
● Purity=(1/N) ∑clusters maxclass ∣cluster∩class∣
Limitations of Clustering
● Ill-Posed Problem: No unique "correct" clustering.
● Distance Metrics: Euclidean fails for categorical data.
● Scalability: PAM is slow for large n.
● Workarounds:
● Use Jaccard similarity for categorical data.
● Sample data before clustering.
Example - K-Means
● Given the following 2D data points, perform K-means
clustering with K=2. Use Euclidean distance and show the first
two iterations.
● Data Points: A(1,2), B(1.5,1.8), C(5,8), D(8,8), E(1,0.5), F(9,11)
● Initial Centroids (chosen randomly):
● μ1 =(1,2) (Point A)
● μ2 =(5,8) (Point C)
Example - K-Means
● Iteration 1
● Step 1: Assign Points
to Nearest Centroid
● Calculate Euclidean
distance (d) from each
point to centroids:
Example - K-Means
● Cluster Assignments: Cluster 1: A, B, E Cluster 2: C, D, F
● Step 2: Update Centroids
● Recalculate centroids as the mean of assigned points:
Example - K-Means
● Iteration 2
● Step 1: Reassign
Points to New
Centroids
Example - K-Means
Cluster Assignments:
Cluster 1: A, B, E
Cluster 2: C, D, F
Step 2: Check Convergence
Centroid remain unchanged (no reassignments).
Algorithm converges.
Example - K-Means
Final Result
Clusters:
Cluster 1 (Red): A(1,2), B(1.5,1.8), E(1,0.5)
Cluster 2 (Blue): C(5,8), D(8,8), F(9,11)
Final Centroids:
μ1 =(1.17,1.43)
μ2 =(7.33,9.00)
Hierarchical Clustering
● Hierarchical clustering is
a method of cluster
analysis that builds a
hierarchy of clusters.
● It starts with each data
point as an individual
cluster and successively
merges the closest
clusters until a single
cluster remains.
Steps in Hierarchical Clustering
1. Start with each data point as a separate cluster.
2. Compute distances between clusters (initially between
individual points).
3. Merge the closest clusters.
4. Repeat until only one cluster remains.
5. Generate a dendrogram to visualize the merging process.
Distance Measures
● There are multiple ways to measure the distance between clusters:
● Single Link Clustering:
○ Distance is defined by the closest pair of points between
clusters.
○ May result in long, chain-like clusters.
● Complete Link Clustering:
○ Distance is defined by the farthest pair of points between
clusters.
○ Tends to produce compact, well-separated clusters.
● Average Link Clustering:
○ Distance is the average of all pairwise distances between
points in two clusters.
Distance Measures
Measuring Distance Between Clusters
● Centroid-Based Distance:
○ Distance is measured between the centroids of
two clusters.
● Radius-Based Distance:
○ Clusters are merged based on the radius of the
combined cluster.
● Diameter-Based Distance:
○ Clusters are merged based on the diameter of the
combined cluster.
Distance Metrics for Data Points
● The distance measure between individual data points
depends on the type of data and can be:
● Euclidean Distance
● Manhattan Distance
● Jaccard Similarity
● Cosine Similarity
● Other domain-specific distance measures
Dendrograms
● A dendrogram is a tree-like diagram that illustrates the order
and levels at which clusters were merged.
● The height at which two clusters are joined represents the
distance between them.
● To determine the final clusters, the dendrogram is cut at a
chosen threshold.
Pros and Cons of Hierarchical Clustering
● Advantages of Hierarchical Clustering
● ✅ No need to specify the number of clusters K in advance.
● ✅ Provides a complete hierarchical decomposition of the
dataset.
● ✅ Can visualize the clustering process using a dendrogram.
● Disadvantages of Hierarchical Clustering
● ❌ Computationally expensive for large datasets.
● ❌ Sensitive to noise and outliers.
● ❌ Merging decisions are irreversible.
Choosing the Number of Clusters
● The knee method or thresholding the dendrogram can help
determine the optimal number of clusters.
● Look for a large jump in merging distances to decide where
to cut the dendrogram.
Example : Hierarchical Clustering
● Given the following dataset with five points:
● Perform hierarchical clustering using the Euclidean distance
and single linkage.
Example : Hierarchical Clustering
● Solution Steps
● Compute the pairwise Euclidean distances:
● Example: Distance between A and B:
● Create an initial distance matrix (distances between all points are computed).
● Find the closest pair: A and B are closest, so merge them into one cluster.
● Recompute the distance matrix:
● Using single linkage, the distance between clusters is the minimum of all
pairwise distances.
● Repeat until only one cluster remains.
Example : Hierarchical Clustering
Step 1: (A,B)=1.41 —C1- Merge them
Step2: Update distance matrix
Example : Hierarchical Clustering
Step 3: (D,E)=2.24 —C2- Merge them
Step4: Update distance matrix
Example : Hierarchical Clustering
Step 5: (C1,C)=3.16 —C3- Merge them
Step 6: Update distance matrix (C4)
/ \
(C3) (C2)
Step 7: Merge all / \ / \
(C1) C D E
/ \
A B
BIRCH Clustering for Large Datasets
● Balanced Iterative Reducing and
Clustering using Hierarchies
● Scalable clustering for massive
data
● Two-phase approach: CF-Tree
(Clustering Features) +
Refinement
● Memory-efficient hierarchical
clustering
Why BIRCH?
● Challenges with Large Data:
○ K-means requires multiple passes (expensive for disk/network I/O).
○ Hierarchical methods are O(n2) in memory.
● BIRCH Solution:
○ Phase 1: Build a CF-Tree (summarize data into tight subclusters).
○ Phase 2: Refine clusters using the CF-Tree’s summaries.
● Key Idea: Reduce n to k representatives (e.g., 1M → 10K).
Clustering Feature (CF) Vector
● Definition: A tuple summarizing a cluster: CF=(N, LS ,SS)
○ N: Number of points in the cluster.
○ LS : Linear Sum of points (vector).
○ SS: Sum of Squares (scalar).
● Example:
● For cluster with points (1,2) and (3,4):
○ CF=(2, (1+3,2+4), (1+9+4+16) = (2,(4,6),30 )
● Properties:
● CFs are additive (merge clusters by adding CFs).
● Enable centroid (LS/N, radius, diameter calculations.
CF-Tree Construction
● Structure:
○ Leaf Nodes: Store CF entries (subclusters).
○ Non-Leaf Nodes: Store CFs summarizing child nodes.
● Algorithm Steps:
○ Insert a point into the closest CF in the leaf (based on
centroid/diameter threshold).
○ If leaf exceeds max entries (e.g., 10), split leaf and propagate CFs
upward.
○ Repeat until all points are processed.
Phase 1 Example (CF-Tree Build)
● Data Points: (1,2),(1.5,1.8),(5,8),(8,8),(9,11)
● Threshold: Max diameter = 3.0
● Key Insight: Points are grouped into tight subclusters
dynamically.
Phase 2: Global Clustering
● Input: CF-Tree leaf entries (subcluster centroids).
● Process:
○ Run any clustering algorithm (e.g., K-means, hierarchical) on the centroids.
○ Assign original data points to the nearest final centroid (1 additional pass).
● Advantages:
○ Scalable: Works with summaries, not raw data.
○ Flexible: Choose any clustering method for refinement.
BIRCH vs. K-means
Limitations of BIRCH
● Order Sensitivity: Early points influence CF-Tree structure.
● Threshold Tuning: Diameter threshold impacts cluster granularity.
● Non-Spherical Clusters: Struggles with arbitrary shapes (like
K-means).
● Workaround:
● Use BIRCH for initial reduction, then DBSCAN for refinement.
CURE: Clustering Using Representatives
1. Handles non-convex clusters
2. Sampling-based approach for scalability
3. Shrinkage-based outlier robustness
Why CURE?
● Limitations of Traditional Methods:
● K-means: Only convex clusters (spherical shapes).
● Hierarchical Methods: O(n2 ) complexity → infeasible for large n.
● CURE’s Solution:
● Sampling: Work with a memory-friendly subset of data.
● Representative Points: Capture cluster boundaries (not just
centroids).
● Shrinkage: Mitigate outlier influence.
Key Steps of CURE
● Sampling: Randomly partition data into subsets (fit in memory).
● Initial Clustering: Run hierarchical clustering on each subset.
● Representative Points:
● For each cluster, pick m farthest points from centroid.
● Shrink them toward centroid by factor α (e.g., α=0.3).
● Reassignment: Assign points to the closest representative.
● Merge: Combine subsets’ representatives and recluster.
Representative Points Selection
● Process:
○ Compute centroid μ of a cluster.
○ Find farthest point p1 from μ.
○ Find farthest point p2 from p1 .
○ Repeat for m points.
○ Shrink: Move each pi toward μ by α×d(pi,μ)
● Example:
○ Cluster points: ((1,1),(1,2),(5,5),(6,6).
○ Centroid μ=(3.25,3.5).
○ Farthest point: p1 =(6,6).
○ Shrunk point (α=0.2):
○ p1′=(6−0.2×(6-3.25), 6−0.2×(6-3.5))≈(5.25,5.5)
Parallelization in CURE
● Scalability Trick:
○ Split data into k random partitions.
○ Process each partition independently (parallelizable).
○ Merge results by clustering all representatives.
● Example:
○ 1M points → 10 partitions of 100K each.
○ Each partition → 100 clusters × 4 reps = 400 points.
○ Final merge: 4K points → manageable clustering.
○ Advantage: Avoids full O(n2) computations.
CURE vs. K-means vs. BIRCH
Parameters in CURE
● Number of Representatives (m): Typically 5–20.
● Shrinkage (α): 0.2–0.5 (balances outlier robustness and boundary
accuracy).
● Sample Size: As large as memory allows.
● Trade-offs:
○ Larger m: Better boundary detection but higher overhead.
○ Smaller α: More outlier resistance but less precise
boundaries.
Limitations of CURE
● Parameter Sensitivity: Performance depends on m, α, and
sample size.
● Order Dependency: Initial sampling affects results.
● Overhead: Representative selection adds computation.
● Workaround:
● Use multiple samples and aggregate (e.g., ensemble clustering).
DBSCAN: Density-Based Clustering
Handling Arbitrary Cluster Shapes and Noise
Key concepts: Core points, density-reachability, noise
Parameters: ε (epsilon) and minPts
Advantages over K-means and hierarchical clustering
Why DBSCAN?
Limitations of Traditional Methods:
K-means: Only convex clusters (fails on non-spherical shapes).
Hierarchical: Computationally expensive (O(n2))
DBSCAN’s Solution:
Density-based: Clusters are dense regions separated by low-density
areas.
Noise Handling: Automatically identifies outliers.
Key Definitions
ε (epsilon): Radius of neighborhood around a point.
minPts: Minimum points to define a dense region.
Core Point: A point with ≥ minPts within ε.
Border Point: Fewer than minPts but reachable from a core
point.
Noise Point: Not reachable from any core point.
Density-Reachability and Connectivity
Density-Reachable: Point p is reachable from q if there’s a path
of core points within ε.
Density-Connected: Points p and q are connected if they share a
common core point.
Cluster Definition:
A cluster is a set of density-connected points.
DBSCAN Algorithm Steps
DBSCAN Algorithm Steps
● Random Start: Pick an unvisited point.
● Core Check: Count neighbors in ε-radius.
● If ≥ minPts, mark as core and expand cluster.
● Expand Cluster: Recursively add density-reachable points.
● Repeat: Until all points are visited.
Parameter Sensitivity
● Effects of ε and minPts:
● Large ε/Small minPts: Fewer, larger clusters (may merge true
clusters).
● Small ε/Large minPts: More clusters, more noise.
●
● Rule of Thumb:
● minPts: Start with minPts≥dimensions+1
● ε: Use k-distance plot (find "knee" for optimal ε).
Advantages & Limitations
● Pros:
● Handles arbitrary shapes and noise.
● No need to specify cluster count (unlike K-means).
● Cons:
● Sensitive to ε and minPts.
● Struggles with varying densities.
● Alternatives:
● OPTICS: Handles varying ε (hierarchical density).
● HDBSCAN: Automates parameter selection.
DBSCAN vs. K-means vs. Hierarchical
Example:DBSCAN
Example:DBSCAN
Point (1,1) (1,2) (2,1) (2,2) (3,3) (8,8) (8,9) (9,8) (9,9) (10,10)
(1,1) 0 1✅ 1✅ 1.41✅ 2.83❌ 9.90❌ 10.63❌ 10.63❌ 11.31❌ 12.73❌
(1,2) 1✅ 0 1.41✅ 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌
(2,1) 1✅ 1.41✅ 0 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌
(2,2) 1.41✅ 1✅ 1✅ 0 1.41✅ 8.48❌ 9.22❌ 9.22❌ 9.90❌ 11.31❌
(3,3) 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0 7.07❌ 7.81❌ 7.81❌ 8.48❌ 9.90❌
(8,8) 9.90❌ 9.22❌ 9.22❌ 8.48❌ 7.07❌ 0 1✅ 1✅ 1.41✅ 2.83❌
(8,9) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 0 1.41✅ 1✅ 2.23❌
(9,8) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 1.41✅ 0 1✅ 2.23❌
(9,9) 11.31❌ 10.63❌ 10.63❌ 9.90❌ 8.48❌ 1.41✅ 1✅ 1✅ 0 1.41✅
(10,10) 12.73❌ 12.08❌ 12.08❌ 11.31❌ 9.90❌ 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Assignment-10 (Cs-101- 2024) (Week-10)
Source
Question-1
In a clustering evaluation, a cluster C contains 50 data points. Of these, 30 belong to class
A, 15 to class B, and 5 to class C. What is the purity of this cluster?
a) 0.5
b) 0.6
c) 0.7
d) 0.8
Question-1- Correct answer
In a clustering evaluation, a cluster C contains 50 data points. Of these, 30 belong
to class A, 15 to class B, and 5 to class C. What is the purity of this cluster?
a) 0.5
b) 0.6
c) 0.7
d) 0.8
Correct options: (b)-Purity = (Number of data points in the most frequent class) / (Total number of
data points)
Question-2
Consider the following 2D dataset with 10 points:
(1, 1),(1, 2),(2, 1),(2, 2),(3, 3),(8, 8),(8, 9),(9, 8),(9, 9),(10, 10)
Using DBSCAN with ϵ = 1.5 and MinPts = 3, how many core points are there in this dataset?
a) 4
b) 5
c) 8
d) 10
Question-2-Explanation
Point (1,1) (1,2) (2,1) (2,2) (3,3) (8,8) (8,9) (9,8) (9,9) (10,10)
(1,1) 0 1✅ 1✅ 1.41✅ 2.83❌ 9.90❌ 10.63❌ 10.63❌ 11.31❌ 12.73❌
(1,2) 1✅ 0 1.41✅ 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌
(2,1) 1✅ 1.41✅ 0 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌
(2,2) 1.41✅ 1✅ 1✅ 0 1.41✅ 8.48❌ 9.22❌ 9.22❌ 9.90❌ 11.31❌
(3,3) 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0 7.07❌ 7.81❌ 7.81❌ 8.48❌ 9.90❌
(8,8) 9.90❌ 9.22❌ 9.22❌ 8.48❌ 7.07❌ 0 1✅ 1✅ 1.41✅ 2.83❌
(8,9) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 0 1.41✅ 1✅ 2.23❌
(9,8) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 1.41✅ 0 1✅ 2.23❌
(9,9) 11.31❌ 10.63❌ 10.63❌ 9.90❌ 8.48❌ 1.41✅ 1✅ 1✅ 0 1.41✅
(10,10) 12.73❌ 12.08❌ 12.08❌ 11.31❌ 9.90❌ 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0
Question-2- Correct answer
Consider the following 2D dataset with 10 points (1, 1),(1, 2),(2, 1),(2, 2),(3, 3),(8, 8),(8, 9),(9, 8),(9, 9),(10, 10)
Using DBSCAN with ϵ = 1.5 and MinPts = 3, how many core points are there in this dataset?
a) 4
b) 5
c) 8
d) 10
Correct options: (c) To be a core point, it needs at least 3 points (including itself) within ϵ = 1.5
distance. There are 8 core points: (1,1), (1,2), (2,1), (2,2) from first group and (8,8), (8,9), (9,8), (9,9)
from second group.
Question-3
Question-3 - Correct answer
Correct options: (c)
Question-4
Which of the following properties are TRUE?
a) Using the CURE algorithm can lead to non-convex clusters.
b) K-means scales better than CURE for large datasets.
c) CURE is a simplification of K-means and hence scales better than
k-means for large datasets.
d) K-means being more expensive to run on large datasets, can give
non-convex clusters too.
Question-4 - Correct answer
Which of the following properties are TRUE?
a) Using the CURE algorithm can lead to non-convex clusters.
b) K-means scales better than CURE for large datasets.
c) CURE is a simplification of K-means and hence scales better than k-means for large
datasets.
d) K-means being more expensive to run on large datasets, can give non-convex clusters too.
Correct options: (a)
Question-5
The pairwise distance between 6 points is given below. Which of the option
shows the hierarchy of clusters created by single link clustering algorithm?
Question-2-Explanation
Step 1: Connect closest pair of points. Closest pairs are:
[C1] d(P3, P4) = 1 , [C2] d(P5, P6) = 2, [C3] d(P1, P2) = 3
Step 2: Connect clusters with single link. The cluster pair to combine is
bolded:
d(C3,C1) = 8 [C4] d(C3, C2) = 4 d(C2, C1) = 6
Step 3: Connect the final 2 clusters
Question-5 - Correct answer
The pairwise distance between 6 points is given below. Which of the option
shows the hierarchy of clusters created by single link clustering algorithm?
Correct options: (b)
Question-6
For the pairwise distance matrix given in the previous question, which of the
following shows the hierarchy of clusters created by the complete link clustering
algorithm.
Question-2-Explanation
Step 1: Connect closest pair of points. Closest pairs are:
[C1] d(P3, P4) = 1 , [C2] d(P5, P6) = 2, [C3] d(P1, P2) = 3
Step 2: Connect clusters with complete link. The cluster pair to combine
is bolded:
d(C3,C1) = 9 [C4] d(C3, C2) = 10 d(C2, C1) = 8
Step 3: Connect the final 2 clusters
Question-6 - Correct answer
For the pairwise distance matrix given in the previous question, which of
the following shows the hierarchy of clusters created by the complete link
clustering algorithm.
Correct options: (d)
Suggestions and Feedback
Next Session:
Tuesday:
08-Apr-2025
6:00 - 8:00 PM