Clustering Analysis
Clustering is an unsupervised learning technique used to group similar data points into clusters based
on their characteristics or patterns. It helps in identifying hidden structures or relationships within data
without using predefined labels.
Each cluster is formed such that:
Data points within the same cluster are more similar to each other.
Data points in different clusters are significantly different.
Applications of Clustering:
Customer segmentation in marketing.
Image compression and pattern recognition.
Anomaly detection in network security.
Grouping genes in biological data analysis.
Common clustering techniques include K-Means, Hierarchical Clustering, and Density-Based Clustering.
Steps in Clustering:
1. Data Preprocessing:
o Normalize data to ensure all features are on the same scale.
o Handle missing values and outliers.
2. Choose a Clustering Algorithm:
o Select an appropriate method based on data characteristics (e.g., K-Means for spherical
clusters, DBSCAN for arbitrary shapes).
3. Define a Similarity/Dissimilarity Metric:
o Common metrics: Euclidean distance, Manhattan distance.
4. Run the Algorithm:
o Execute the clustering process iteratively.
5. Evaluate Results:
o Use metrics like silhouette score, Dunn index, or Davies-Bouldin index to assess cluster
quality.
K means Clustering –
Introduction
K means Clustering is an unsupervised machine learning algorithm, which groups the unlabeled dataset
into different clusters.
How k-means clustering works?
Initialization
Select the number of clusters, K.
Randomly initialize K cluster centroids (or select them from the elbow method / silhouette
score).
2. Assignment Step
Assign each data point to the cluster whose centroid is closest to it.
The closeness is usually measured using the Euclidean distance or other distance metrics.
3. Update Step
Recalculate the centroids of each cluster by taking the mean of all data points assigned to that
cluster.
4. Repeat
Repeat the Assignment and Update steps until:
o The cluster assignments no longer change, or
o The centroids stabilize, or
o A predefined maximum number of iterations is reached.
5. Output
The algorithm outputs KKK clusters and their centroids. Each data point belongs to one cluster.
Key Points:
K-Means aims to minimize the within-cluster sum of squares (WCSS), which represents the total
variance within clusters.
It works best with spherical, well-separated clusters and requires the user to predefine KKK
The Elbow Method is a technique used to determine the optimal number of clusters (K) for K-Means
clustering. It helps find the KKK that best balances the trade-off between minimizing the within-cluster
variance and avoiding overfitting with too many clusters.
Steps to Apply the Elbow Method:
1. Run K-Means for a Range of K:
o Perform K-Means clustering for a range of values for K (e.g., K=1K =1 to K=10K =10).
2. Calculate the WCSS (Within-Cluster Sum of Squares):
o For each KKK, calculate the WCSS, which measures the variance within each cluster.
3. Plot K vs. WCSS:
o Create a line plot with KKK on the x-axis and WCSS on the y-axis.
4. Identify the "Elbow Point":
o Look for the point on the curve where the reduction in WCSS slows down significantly.
This point resembles an "elbow" in the plot.
o The K corresponding to the elbow is considered the optimal number of clusters.
Hierarchical Clustering:
Hierarchical Clustering is an unsupervised machine learning technique used to group data into a
hierarchy of clusters. Unlike K-Means, it does not require the user to specify the number of clusters
upfront.
There are two main approaches:
1. Agglomerative (Bottom-Up)
o Start with each data point as its own cluster.
o Iteratively merge the closest clusters until all points are grouped into a single cluster or a
desired number of clusters is achieved.
2. Divisive (Top-Down)
o Start with all data points in a single cluster.
o Iteratively split clusters until each data point is its own cluster or a desired number of
clusters is reached.
How it Works
1. Compute a Distance Matrix:
o Calculate the distances between all pairs of data points using metrics like Euclidean,
Manhattan, or Cosine distance.
2. Linkage Criteria (for Agglomerative):
o Single Linkage: Distance between the nearest points in two clusters.
o Complete Linkage: Distance between the farthest points in two clusters.
o Average Linkage: Average distance between all points in two clusters.
o Centroid Linkage: Distance between centroids of two clusters.
o Ward Linkage: minimizes the total within-cluster variance
3. Build a Dendrogram:
o A dendrogram is a tree-like structure that shows the sequence of merges or splits.
o The y-axis represents the distance or dissimilarity between clusters.
4. Decide the Number of Clusters:
o Cut the dendrogram at a chosen distance to form clusters.
Advantages
Does not require the number of clusters to be pre-specified.
Produces a hierarchy, offering insights into cluster relationships.
Disadvantages
Computationally expensive for large datasets (time complexity O(n3)O(n^3)O(n3)).
Sensitive to noise and outliers.
Density-Based Clustering
Density-Based Clustering is a class of clustering algorithms that groups data points based on their
density in the feature space. It is particularly useful for identifying clusters of arbitrary shape and
handling noise in the data.
Key Concept:
Clusters are formed by regions of high data point density, separated by regions of low density.
Unlike K-Means, which assumes spherical clusters, density-based methods can find clusters of
any shape, making them more flexible for real-world data.
Popular Density-Based Clustering Algorithm: DBSCAN (Density-Based Spatial Clustering of
Applications with Noise)
How DBSCAN Works:
1. Core Points, Border Points, and Noise Points:
o Core Point: A point that has at least a specified minimum number of points (MinPts)
within a given radius (ϵ\epsilon).
o Border Point: A point that is not a core point, but lies within the ϵ distance from a core
point.
o Noise Point (Outlier): A point that is neither a core point nor a border point. It does not
belong to any cluster.
2. Steps:
o Initialization: Start with a random point and check if it’s a core point.
o Expansion: If the point is a core point, form a cluster by including all points that are
directly reachable (within ϵ\epsilonϵ).
o Iteration: Recursively expand the cluster by adding neighboring points, which may also
become core points.
o Cluster Formation: Once a core point’s neighbors are expanded, continue with the next
unvisited point and repeat the process.
3. Parameters:
o ϵ\epsilonϵ: The radius (neighborhood) within which points are considered neighbors.
o MinPts: The minimum number of points required to form a dense region (core point).
4. Cluster Output:
o Points that are part of the same dense region form a cluster.
o Points that are outliers (noise) do not belong to any cluster.
Advantages:
Arbitrary Shape Clusters: DBSCAN can find clusters of arbitrary shapes, unlike K-Means, which
assumes spherical clusters.
Noise Handling: It naturally identifies outliers as noise points.
No Need to Specify Number of Clusters: Unlike K-Means, DBSCAN does not require the user to
specify the number of clusters in advance.
Disadvantages:
Sensitive to Parameters: The choice of ϵ\epsilonϵ and MinPts can significantly impact the
clustering result.
Not Efficient for Large Datasets: DBSCAN can be computationally expensive for large datasets.
Example:
ϵ=0.5and MinPts = 4: If a point has at least 4 other points within a 0.5 distance, it becomes a
core point, and a cluster is formed by expanding from that core.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction
while preserving as much variance (information) in the data as possible. PCA transforms the original
features into a new set of orthogonal (uncorrelated) features, called principal components, which
are ordered by the amount of variance they capture from the data.
How PCA Works:
1. Standardize the Data:
o Since PCA is affected by the scale of the features, the data is often standardized (e.g.,
using z-scores) so that each feature has a mean of 0 and a standard deviation of 1. This
ensures that features with larger scales don't dominate the analysis.
2. Compute the Covariance Matrix:
o The covariance matrix represents how each pair of features in the dataset is related to
each other. High covariance means the features vary together, while low covariance
indicates that the features are independent.
3. Calculate Eigenvalues and Eigenvectors:
o The covariance matrix is decomposed into eigenvalues and eigenvectors.
Eigenvectors: Directions of maximum variance in the data. These represent the
principal components.
Eigenvalues: The magnitude of variance along each principal component. The
larger the eigenvalue, the more variance is explained by the corresponding
eigenvector.
4. Sort Eigenvalues and Eigenvectors:
o The eigenvectors are sorted in descending order of their eigenvalues. The eigenvector
with the highest eigenvalue becomes the first principal component (PC1), the second-
highest becomes the second principal component (PC2), and so on.
5. Select the Top k Principal Components:
o Based on the eigenvalues, select the top kkk eigenvectors that explain the most variance
in the data. This step reduces the dimensionality of the data.
6. Project the Data:
o The original data is projected onto the selected eigenvectors (principal components).
This step reduces the number of dimensions while retaining most of the variance.
Example:
Suppose you have a dataset with 3 features. After applying PCA, you could reduce it to 2
dimensions (2 principal components) while retaining most of the information from the original 3
features.
Key Concepts:
Explained Variance: Measures how much information (variance) is captured by each principal
component. The first principal component captures the highest variance, the second captures
the next highest, and so on.
Dimensionality Reduction: PCA helps to reduce the number of features (dimensions) while
preserving the data’s variance, making it easier to visualize and analyze high-dimensional data.
Advantages:
Noise Reduction: By keeping only the top principal components, PCA can reduce noise and
irrelevant features.
Visualization: Reducing dimensions to 2 or 3 allows for easy visualization of data.
Improved Model Performance: Reducing the feature space may improve the performance of
machine learning models by removing correlated features.
Disadvantages:
Interpretability: The new components (PCs) are linear combinations of the original features,
which may not be easily interpretable.
Assumes Linear Relationships: PCA assumes linear relationships between features, which may
not capture complex nonlinear patterns in the data.