Cluster Analysis:
Cluster analysis is a technique used in data mining and machine learning for
grouping a set of objects or observations into clusters (or groups), where
objects within the same cluster are more similar to each other than to those in
other clusters. It’s often used to discover patterns or inherent groupings in
data. Clustering is unsupervised learning, meaning it doesn’t rely on labeled
data but instead seeks to find natural groupings based on the inherent
structure of the data.
1. Introduction to Cluster Analysis
Cluster analysis aims to partition data into distinct groups such that:
• Objects within a cluster are as similar as possible to each other.
• Objects in different clusters are as dissimilar as possible from each
other.
Clustering can be used for a variety of purposes:
• Data exploration: To identify patterns or segments in data (e.g., market
segmentation in business).
• Dimensionality reduction: By grouping similar items, you can reduce
complexity and simplify further analysis.
• Outlier detection: Outliers may be identified as data points that do not
fit into any cluster well.
2. Types of Clustering
Clustering methods can be broadly classified into several types, based on the
approach they take to form clusters.
a) Partitioning Clustering:
• This method divides the data into non-overlapping clusters where each
data point belongs to one and only one cluster.
• The goal is to partition the data such that the similarity within clusters is
maximized and the similarity between clusters is minimized.
Example: K-means clustering and K-medoids.
b) Hierarchical Clustering:
• This method builds a tree-like structure (dendrogram) to represent
clusters at different levels.
• It can either be agglomerative (bottom-up approach) or divisive (top-
down approach). In agglomerative clustering, each data point starts in its
own cluster, and pairs of clusters are merged iteratively. In divisive
clustering, all data points start in one cluster, which is split recursively.
Example: Agglomerative Hierarchical Clustering (AHC).
c) Density-based Clustering:
• This method groups together points that are densely packed, and marks
points in low-density regions as outliers.
• It is especially useful when clusters are not spherical or have irregular
shapes.
Example: DBSCAN (Density-Based Spatial Clustering of Applications with
Noise).
d) Model-based Clustering:
• In this approach, clusters are assumed to follow a certain statistical
model (e.g., Gaussian mixture models). The algorithm fits the data to
these models and assigns data points to clusters based on the probability
of their belonging to each model.
Example: Gaussian Mixture Models (GMM).
e) Overlapping Clustering:
• In overlapping clustering, a data point can belong to more than one
cluster.
• The membership of each data point in each cluster is represented by a
membership degree or probability.
Example: Fuzzy C-means clustering.
3. Correlation and Distances in Clustering
a) Correlation:
• Correlation measures the linear relationship between variables. In
clustering, correlation can be used to determine how similar or related
two variables are.
• For clustering, distance-based metrics are typically used, but
correlation-based distance measures can be important when the data is
on the same scale and you are more concerned with the pattern of
relationships than absolute values.
Example: Pearson correlation or Spearman rank correlation.
b) Distance Measures:
• In most clustering algorithms, distance metrics are used to quantify the
similarity or dissimilarity between two data points. Common distance
measures include:
o Euclidean Distance: The straight-line distance between two points
in a multidimensional space.
o Manhattan Distance: The sum of absolute differences between
coordinates.
o Cosine Similarity: Measures the cosine of the angle between two
vectors, used when the magnitude of vectors is less important
than their direction.
o Minkowski Distance: A generalization of both Euclidean and
Manhattan distances.
4. Clustering by Partitioning Methods
a) K-means Clustering:
• One of the most widely used partitioning clustering methods.
• The K-means algorithm aims to partition the data into K clusters by
minimizing the sum of squared distances between each point and the
centroid (mean) of the cluster to which it belongs.
Steps:
1. Choose the number of clusters K.
2. Randomly initialize K centroids.
3. Assign each data point to the nearest centroid.
4. Recompute the centroids as the mean of the points assigned to
each cluster.
5. Repeat steps 3 and 4 until convergence.
• Pros:
o Simple, fast, and scalable for large datasets.
o Works well when clusters are spherical and equally sized.
• Cons:
o Sensitive to initialization of centroids.
o Assumes a fixed number of clusters (K).
o Struggles with non-spherical clusters.
b) K-medoids Clustering:
• Similar to K-means, but instead of using the mean of the points in a
cluster to define the centroid, it uses an actual data point (medoid) as
the cluster center. This is more robust to outliers than K-means.
5. Hierarchical Clustering
Hierarchical clustering creates a tree of clusters (also called a dendrogram) that
can be used to visualize the relationships between the data points at various
levels of granularity. It can be agglomerative or divisive:
a) Agglomerative (Bottom-Up):
• Start with each data point as its own cluster.
• Merge the closest clusters at each step until all points belong to a single
cluster.
Linkage Methods:
o Single linkage: The shortest distance between points in two
clusters.
o Complete linkage: The longest distance between points in two
clusters.
o Average linkage: The average distance between points in two
clusters.
b) Divisive (Top-Down):
• Start with all points in one cluster and recursively split the cluster into
two until each point is its own cluster.
6. Overlapping Clustering
In overlapping clustering, a data point can belong to more than one cluster
with different membership degrees.
Fuzzy clustering is a common approach where each data point is associated
with a degree of membership to each cluster.
Fuzzy C-means Clustering:
• It is an extension of K-means where each data point has a membership
value between 0 and 1 for each cluster, instead of being fully assigned to
just one cluster.
• The membership values are updated based on the distance of data
points from the cluster centers.
7. K-means Clustering
As discussed earlier, K-means is a partitioning method used for clustering. The
K-means algorithm attempts to minimize the intra-cluster variance by
optimizing the centroids of the clusters.
Challenges:
• The choice of K (number of clusters) is critical, and the algorithm is
sensitive to the initial placement of centroids.
• K-means assumes spherical clusters and may not work well with non-
convex shapes.
8. Profiling and Interpreting Clusters
Once the clusters are formed, profiling is the process of describing and
interpreting the characteristics of each cluster. This helps to understand the
distinctive features of each cluster.
Profiling Clusters:
• Analyze the mean or median values of the variables in each cluster.
• Use descriptive statistics (e.g., frequency counts, averages) to
summarize cluster characteristics.
• Compare clusters across the variables used for clustering to identify the
main features that differentiate them.
Interpreting Clusters:
• The interpretation depends on the context of the data.
• For example, in customer segmentation, a cluster may represent a group
of customers who are price-sensitive, while another cluster might
represent a group who prioritize product quality.
• Visualizations, like box plots, scatter plots, and radar charts, can help
compare and interpret the cluster profiles.
Conclusion
Cluster analysis is a powerful tool for grouping similar data points
into clusters based on distance or similarity measures. The different
methods of clustering — partitioning methods (like K-means),
hierarchical clustering, and overlapping clustering — offer
different approaches depending on the type of data and the underlying
structure of the problem. The key challenge in clustering lies in
selecting the right method, determining the number of clusters, and
properly interpreting the results.