Clustering
What is Clustering?
• Clustering is the process of grouping data points into clusters based
on their similarities.
• Unsupervised learning: No predefined labels for data points.
• Similarity measure: Quantifies how similar data points are (e.g.,
Euclidean distance).
Clustering Algorithms:
• K-Means Clustering (Centroid-Based):
• Hierarchical Clustering:
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN):
K-Means Clustering (Centroid-
Based):
• A popular and intuitive algorithm that partitions data points into a
pre-defined number of clusters (k).
• It iteratively assigns data points to the closest cluster center (centroid)
and then recalculates the centroid based on the assigned points.
• This process continues until a stopping criterion (e.g., minimal
centroid movement) is met.
• Strengths: Simple to understand and implement, efficient for large
datasets.
• Weaknesses: Requires specifying the number of clusters (k)
beforehand, may not work well for non-spherical clusters.
Hierarchical Clustering:
• This family of algorithms builds a hierarchy of clusters, either in a top-down
(divisive) or bottom-up (agglomerative) fashion.
• Divisive clustering starts with all data points in one cluster and iteratively
divides them into smaller clusters based on a similarity measure.
• Agglomerative clustering starts with individual data points as separate
clusters and merges them based on similarity until a desired number of
clusters is reached.
• Strengths: Doesn't require pre-defining the number of clusters, good for
visualizing data relationships.
• Weaknesses: Can be computationally expensive for large datasets, the
resulting hierarchy might not be easily interpretable.
DBSCAN:
• This algorithm identifies clusters based on areas of high density (many
data points close together) separated by areas of low density. It can
also handle outliers (noise points).
• Strengths: Doesn't require specifying the number of clusters or
assuming specific cluster shapes, can handle noise.
• Weaknesses: May not work well for high-dimensional data,
parameter tuning can be challenging.
Choosing the Right Algorithm:
The choice of clustering algorithm depends on various factors including:
• Data type (numerical vs. categorical)
• Expected number of clusters
• Presence of noise or outliers
• Computational efficiency needs
Use K-Means When:
• You have a good idea of the number of clusters in your data: K-Means
requires pre-specifying the number of clusters (k). If you have domain
knowledge or insights from visualizations that suggest the number of
clusters, K-Means can be a good choice.
• Dealing with large datasets: K-Means is computationally efficient and
scales well for large datasets due to its iterative refinement process.
• Spherical-shaped clusters: K-Means works best when the clusters in your
data are roughly spherical or circular in shape.
• Fast prototyping and initial exploration: Due to its simplicity, K-Means is a
good starting point for initial exploration and understanding the data
distribution before potentially using more complex algorithms.
Avoid K-Means When:
• Number of clusters is unknown: If you don't have a good idea of the
number of clusters, K-Means might not be ideal. Initializing with the
wrong number of clusters can lead to suboptimal results.
• Non-spherical clusters: K-Means struggles with data where clusters
have irregular shapes (elongated, crescent-shaped).
• Presence of outliers: Outliers can significantly impact the centroids
(cluster centers) in K-Means, leading to skewed results.
Use Hierarchical Clustering
When:
• Number of clusters is unknown: Hierarchical clustering doesn't
require pre-defining the number of clusters. You can explore the
resulting hierarchy and choose a cut-off point based on your needs.
• Data has complex cluster shapes: Hierarchical clustering can handle
clusters of irregular shapes (elongated, intertwined) better than K-
Means.
• Visualization of data relationships: The hierarchical structure created
by hierarchical clustering can be helpful for visualizing how data
points are related and grouped at different levels of granularity.
Avoid Hierarchical Clustering
When:
• Dealing with very large datasets: Hierarchical clustering can be
computationally expensive for massive datasets due to the need to
compare all data points at each step.
• Need for highly efficient clustering: If speed is a critical factor, K-
Means might be a better choice.
• Interpretability of individual clusters: While the hierarchy provides
insights into data relationships, understanding the characteristics of
individual clusters within the hierarchy might require further analysis.
Use DBSCAN When:
• Uncertain Number of Clusters: Unlike K-Means, which requires specifying the
number of clusters beforehand, DBSCAN automatically discovers clusters based on
density. This makes it ideal when the true number of clusters is unknown in your data.
• Non-Spherical Clusters: K-Means struggles with data where clusters have irregular
shapes. DBSCAN excels in such cases, as it identifies clusters based on density rather
than assuming specific shapes.
• Presence of Noise or Outliers: Outliers can significantly impact centroid-based
algorithms like K-Means. DBSCAN is robust to outliers, effectively treating them as
noise and focusing on dense regions for cluster formation.
• High-Dimensional Data: While some clustering algorithms struggle with high-
dimensional data (many features), DBSCAN often performs well due to its reliance on
density rather than distance calculations in all dimensions.
Avoid DBSCAN when:
• Computational Cost: For very large datasets, DBSCAN can be
computationally expensive compared to K-Means, especially in high
dimensions. Consider data size and processing limitations.
• Parameter Tuning: DBSCAN relies on two key parameters: eps (minimum
distance between points to be considered neighbours) and minPts (minimum
number of neighbours for a point to be a core point). Choosing the right
values can be crucial for optimal results and might require experimentation.
• Data Pre-processing: DBSCAN assumes some level of similarity within the
data for density-based clustering. If your data has significant scaling
differences between features, normalization or scaling might be necessary
for effective results.