KEMBAR78
Clustering | PDF | Cluster Analysis | Theoretical Computer Science
0% found this document useful (0 votes)
12 views14 pages

Clustering

Clustering is an unsupervised learning technique that groups data points based on their similarities, using algorithms like K-Means, Hierarchical Clustering, and DBSCAN. Each algorithm has its strengths and weaknesses, with K-Means being efficient for large, spherical datasets, Hierarchical Clustering being useful for unknown cluster numbers and complex shapes, and DBSCAN excelling in handling noise and irregular clusters. The choice of algorithm depends on factors such as data type, expected cluster shapes, and computational efficiency needs.

Uploaded by

abhijaychauhan88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views14 pages

Clustering

Clustering is an unsupervised learning technique that groups data points based on their similarities, using algorithms like K-Means, Hierarchical Clustering, and DBSCAN. Each algorithm has its strengths and weaknesses, with K-Means being efficient for large, spherical datasets, Hierarchical Clustering being useful for unknown cluster numbers and complex shapes, and DBSCAN excelling in handling noise and irregular clusters. The choice of algorithm depends on factors such as data type, expected cluster shapes, and computational efficiency needs.

Uploaded by

abhijaychauhan88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Clustering

What is Clustering?
• Clustering is the process of grouping data points into clusters based
on their similarities.
• Unsupervised learning: No predefined labels for data points.
• Similarity measure: Quantifies how similar data points are (e.g.,
Euclidean distance).
Clustering Algorithms:
• K-Means Clustering (Centroid-Based):
• Hierarchical Clustering:
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN):
K-Means Clustering (Centroid-
Based):
• A popular and intuitive algorithm that partitions data points into a
pre-defined number of clusters (k).
• It iteratively assigns data points to the closest cluster center (centroid)
and then recalculates the centroid based on the assigned points.
• This process continues until a stopping criterion (e.g., minimal
centroid movement) is met.
• Strengths: Simple to understand and implement, efficient for large
datasets.
• Weaknesses: Requires specifying the number of clusters (k)
beforehand, may not work well for non-spherical clusters.
Hierarchical Clustering:
• This family of algorithms builds a hierarchy of clusters, either in a top-down
(divisive) or bottom-up (agglomerative) fashion.
• Divisive clustering starts with all data points in one cluster and iteratively
divides them into smaller clusters based on a similarity measure.
• Agglomerative clustering starts with individual data points as separate
clusters and merges them based on similarity until a desired number of
clusters is reached.
• Strengths: Doesn't require pre-defining the number of clusters, good for
visualizing data relationships.
• Weaknesses: Can be computationally expensive for large datasets, the
resulting hierarchy might not be easily interpretable.
DBSCAN:
• This algorithm identifies clusters based on areas of high density (many
data points close together) separated by areas of low density. It can
also handle outliers (noise points).
• Strengths: Doesn't require specifying the number of clusters or
assuming specific cluster shapes, can handle noise.
• Weaknesses: May not work well for high-dimensional data,
parameter tuning can be challenging.
Choosing the Right Algorithm:
The choice of clustering algorithm depends on various factors including:
• Data type (numerical vs. categorical)
• Expected number of clusters
• Presence of noise or outliers
• Computational efficiency needs
Use K-Means When:
• You have a good idea of the number of clusters in your data: K-Means
requires pre-specifying the number of clusters (k). If you have domain
knowledge or insights from visualizations that suggest the number of
clusters, K-Means can be a good choice.
• Dealing with large datasets: K-Means is computationally efficient and
scales well for large datasets due to its iterative refinement process.
• Spherical-shaped clusters: K-Means works best when the clusters in your
data are roughly spherical or circular in shape.
• Fast prototyping and initial exploration: Due to its simplicity, K-Means is a
good starting point for initial exploration and understanding the data
distribution before potentially using more complex algorithms.
Avoid K-Means When:
• Number of clusters is unknown: If you don't have a good idea of the
number of clusters, K-Means might not be ideal. Initializing with the
wrong number of clusters can lead to suboptimal results.
• Non-spherical clusters: K-Means struggles with data where clusters
have irregular shapes (elongated, crescent-shaped).
• Presence of outliers: Outliers can significantly impact the centroids
(cluster centers) in K-Means, leading to skewed results.
Use Hierarchical Clustering
When:
• Number of clusters is unknown: Hierarchical clustering doesn't
require pre-defining the number of clusters. You can explore the
resulting hierarchy and choose a cut-off point based on your needs.
• Data has complex cluster shapes: Hierarchical clustering can handle
clusters of irregular shapes (elongated, intertwined) better than K-
Means.
• Visualization of data relationships: The hierarchical structure created
by hierarchical clustering can be helpful for visualizing how data
points are related and grouped at different levels of granularity.
Avoid Hierarchical Clustering
When:
• Dealing with very large datasets: Hierarchical clustering can be
computationally expensive for massive datasets due to the need to
compare all data points at each step.
• Need for highly efficient clustering: If speed is a critical factor, K-
Means might be a better choice.
• Interpretability of individual clusters: While the hierarchy provides
insights into data relationships, understanding the characteristics of
individual clusters within the hierarchy might require further analysis.
Use DBSCAN When:
• Uncertain Number of Clusters: Unlike K-Means, which requires specifying the
number of clusters beforehand, DBSCAN automatically discovers clusters based on
density. This makes it ideal when the true number of clusters is unknown in your data.
• Non-Spherical Clusters: K-Means struggles with data where clusters have irregular
shapes. DBSCAN excels in such cases, as it identifies clusters based on density rather
than assuming specific shapes.
• Presence of Noise or Outliers: Outliers can significantly impact centroid-based
algorithms like K-Means. DBSCAN is robust to outliers, effectively treating them as
noise and focusing on dense regions for cluster formation.
• High-Dimensional Data: While some clustering algorithms struggle with high-
dimensional data (many features), DBSCAN often performs well due to its reliance on
density rather than distance calculations in all dimensions.
Avoid DBSCAN when:
• Computational Cost: For very large datasets, DBSCAN can be
computationally expensive compared to K-Means, especially in high
dimensions. Consider data size and processing limitations.
• Parameter Tuning: DBSCAN relies on two key parameters: eps (minimum
distance between points to be considered neighbours) and minPts (minimum
number of neighbours for a point to be a core point). Choosing the right
values can be crucial for optimal results and might require experimentation.
• Data Pre-processing: DBSCAN assumes some level of similarity within the
data for density-based clustering. If your data has significant scaling
differences between features, normalization or scaling might be necessary
for effective results.

You might also like