0% found this document useful (0 votes)

13 views12 pages

Clustering Analysis

Uploaded by

Vidhi Tanwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views12 pages

Clustering Analysis

Uploaded by

Vidhi Tanwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Clustering Analysis

Clustering is an unsupervised learning technique used to group similar data points into clusters based
on their characteristics or patterns. It helps in identifying hidden structures or relationships within data
without using predefined labels.

Each cluster is formed such that:

 Data points within the same cluster are more similar to each other.

 Data points in different clusters are significantly different.

Applications of Clustering:

 Customer segmentation in marketing.

 Image compression and pattern recognition.

 Anomaly detection in network security.

 Grouping genes in biological data analysis.

Common clustering techniques include K-Means, Hierarchical Clustering, and Density-Based Clustering.

Steps in Clustering:

1. Data Preprocessing:

o Normalize data to ensure all features are on the same scale.

o Handle missing values and outliers.

2. Choose a Clustering Algorithm:

o Select an appropriate method based on data characteristics (e.g., K-Means for spherical
clusters, DBSCAN for arbitrary shapes).

3. Define a Similarity/Dissimilarity Metric:

o Common metrics: Euclidean distance, Manhattan distance.

4. Run the Algorithm:

o Execute the clustering process iteratively.

5. Evaluate Results:

o Use metrics like silhouette score, Dunn index, or Davies-Bouldin index to assess cluster
quality.

K means Clustering –

Introduction
K means Clustering is an unsupervised machine learning algorithm, which groups the unlabeled dataset
into different clusters.

How k-means clustering works?

Initialization

 Select the number of clusters, K.

 Randomly initialize K cluster centroids (or select them from the elbow method / silhouette
score).

2. Assignment Step

 Assign each data point to the cluster whose centroid is closest to it.

 The closeness is usually measured using the Euclidean distance or other distance metrics.

3. Update Step

 Recalculate the centroids of each cluster by taking the mean of all data points assigned to that
cluster.

4. Repeat

 Repeat the Assignment and Update steps until:

o The cluster assignments no longer change, or

o The centroids stabilize, or

o A predefined maximum number of iterations is reached.

5. Output

 The algorithm outputs KKK clusters and their centroids. Each data point belongs to one cluster.

Key Points:

 K-Means aims to minimize the within-cluster sum of squares (WCSS), which represents the total
variance within clusters.

 It works best with spherical, well-separated clusters and requires the user to predefine KKK
The Elbow Method is a technique used to determine the optimal number of clusters (K) for K-Means
clustering. It helps find the KKK that best balances the trade-off between minimizing the within-cluster
variance and avoiding overfitting with too many clusters.

Steps to Apply the Elbow Method:

1. Run K-Means for a Range of K:

o Perform K-Means clustering for a range of values for K (e.g., K=1K =1 to K=10K =10).

2. Calculate the WCSS (Within-Cluster Sum of Squares):

o For each KKK, calculate the WCSS, which measures the variance within each cluster.

3. Plot K vs. WCSS:

o Create a line plot with KKK on the x-axis and WCSS on the y-axis.

4. Identify the "Elbow Point":

o Look for the point on the curve where the reduction in WCSS slows down significantly.
This point resembles an "elbow" in the plot.

o The K corresponding to the elbow is considered the optimal number of clusters.

Hierarchical Clustering:
Hierarchical Clustering is an unsupervised machine learning technique used to group data into a
hierarchy of clusters. Unlike K-Means, it does not require the user to specify the number of clusters
upfront.

There are two main approaches:

1. Agglomerative (Bottom-Up)

o Start with each data point as its own cluster.

o Iteratively merge the closest clusters until all points are grouped into a single cluster or a
desired number of clusters is achieved.

2. Divisive (Top-Down)

o Start with all data points in a single cluster.

o Iteratively split clusters until each data point is its own cluster or a desired number of
clusters is reached.

How it Works

1. Compute a Distance Matrix:

o Calculate the distances between all pairs of data points using metrics like Euclidean,
Manhattan, or Cosine distance.

2. Linkage Criteria (for Agglomerative):

o Single Linkage: Distance between the nearest points in two clusters.

o Complete Linkage: Distance between the farthest points in two clusters.

o Average Linkage: Average distance between all points in two clusters.

o Centroid Linkage: Distance between centroids of two clusters.

o Ward Linkage: minimizes the total within-cluster variance

3. Build a Dendrogram:

o A dendrogram is a tree-like structure that shows the sequence of merges or splits.

o The y-axis represents the distance or dissimilarity between clusters.

4. Decide the Number of Clusters:

o Cut the dendrogram at a chosen distance to form clusters.

Advantages

 Does not require the number of clusters to be pre-specified.

 Produces a hierarchy, offering insights into cluster relationships.

Disadvantages

 Computationally expensive for large datasets (time complexity O(n3)O(n^3)O(n3)).

 Sensitive to noise and outliers.

Density-Based Clustering
Density-Based Clustering is a class of clustering algorithms that groups data points based on their
density in the feature space. It is particularly useful for identifying clusters of arbitrary shape and
handling noise in the data.

Key Concept:

 Clusters are formed by regions of high data point density, separated by regions of low density.

 Unlike K-Means, which assumes spherical clusters, density-based methods can find clusters of
any shape, making them more flexible for real-world data.
Popular Density-Based Clustering Algorithm: DBSCAN (Density-Based Spatial Clustering of
Applications with Noise)

How DBSCAN Works:

1. Core Points, Border Points, and Noise Points:

o Core Point: A point that has at least a specified minimum number of points (MinPts)
within a given radius (ϵ\epsilon).

o Border Point: A point that is not a core point, but lies within the ϵ distance from a core
point.

o Noise Point (Outlier): A point that is neither a core point nor a border point. It does not
belong to any cluster.

2. Steps:

o Initialization: Start with a random point and check if it’s a core point.

o Expansion: If the point is a core point, form a cluster by including all points that are
directly reachable (within ϵ\epsilonϵ).

o Iteration: Recursively expand the cluster by adding neighboring points, which may also
become core points.

o Cluster Formation: Once a core point’s neighbors are expanded, continue with the next
unvisited point and repeat the process.

3. Parameters:

o ϵ\epsilonϵ: The radius (neighborhood) within which points are considered neighbors.

o MinPts: The minimum number of points required to form a dense region (core point).
4. Cluster Output:

o Points that are part of the same dense region form a cluster.

o Points that are outliers (noise) do not belong to any cluster.

Advantages:

 Arbitrary Shape Clusters: DBSCAN can find clusters of arbitrary shapes, unlike K-Means, which
assumes spherical clusters.

 Noise Handling: It naturally identifies outliers as noise points.

 No Need to Specify Number of Clusters: Unlike K-Means, DBSCAN does not require the user to
specify the number of clusters in advance.

Disadvantages:

 Sensitive to Parameters: The choice of ϵ\epsilonϵ and MinPts can significantly impact the
clustering result.

 Not Efficient for Large Datasets: DBSCAN can be computationally expensive for large datasets.

Example:

 ϵ=0.5and MinPts = 4: If a point has at least 4 other points within a 0.5 distance, it becomes a
core point, and a cluster is formed by expanding from that core.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction
while preserving as much variance (information) in the data as possible. PCA transforms the original
features into a new set of orthogonal (uncorrelated) features, called principal components, which
are ordered by the amount of variance they capture from the data.

How PCA Works:

1. Standardize the Data:

o Since PCA is affected by the scale of the features, the data is often standardized (e.g.,
using z-scores) so that each feature has a mean of 0 and a standard deviation of 1. This
ensures that features with larger scales don't dominate the analysis.

2. Compute the Covariance Matrix:

o The covariance matrix represents how each pair of features in the dataset is related to
each other. High covariance means the features vary together, while low covariance
indicates that the features are independent.

3. Calculate Eigenvalues and Eigenvectors:

o The covariance matrix is decomposed into eigenvalues and eigenvectors.

 Eigenvectors: Directions of maximum variance in the data. These represent the
principal components.

 Eigenvalues: The magnitude of variance along each principal component. The

larger the eigenvalue, the more variance is explained by the corresponding
eigenvector.

4. Sort Eigenvalues and Eigenvectors:

o The eigenvectors are sorted in descending order of their eigenvalues. The eigenvector
with the highest eigenvalue becomes the first principal component (PC1), the second-
highest becomes the second principal component (PC2), and so on.

5. Select the Top k Principal Components:

o Based on the eigenvalues, select the top kkk eigenvectors that explain the most variance
in the data. This step reduces the dimensionality of the data.

6. Project the Data:

o The original data is projected onto the selected eigenvectors (principal components).
This step reduces the number of dimensions while retaining most of the variance.

Example:

 Suppose you have a dataset with 3 features. After applying PCA, you could reduce it to 2
dimensions (2 principal components) while retaining most of the information from the original 3
features.

Key Concepts:

 Explained Variance: Measures how much information (variance) is captured by each principal
component. The first principal component captures the highest variance, the second captures
the next highest, and so on.

 Dimensionality Reduction: PCA helps to reduce the number of features (dimensions) while
preserving the data’s variance, making it easier to visualize and analyze high-dimensional data.

Advantages:

 Noise Reduction: By keeping only the top principal components, PCA can reduce noise and
irrelevant features.

 Visualization: Reducing dimensions to 2 or 3 allows for easy visualization of data.

 Improved Model Performance: Reducing the feature space may improve the performance of
machine learning models by removing correlated features.

Disadvantages:

 Interpretability: The new components (PCs) are linear combinations of the original features,
which may not be easily interpretable.
 Assumes Linear Relationships: PCA assumes linear relationships between features, which may
not capture complex nonlinear patterns in the data.

Unit 4
No ratings yet
Unit 4
16 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Unit 4
No ratings yet
Unit 4
19 pages
EML %TH Module
No ratings yet
EML %TH Module
40 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
6 Clustering
No ratings yet
6 Clustering
15 pages
Lecture8 Unsupervised Learning
No ratings yet
Lecture8 Unsupervised Learning
58 pages
Clustering
No ratings yet
Clustering
11 pages
Clustering Personal
No ratings yet
Clustering Personal
9 pages
Partition
No ratings yet
Partition
52 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
M5
No ratings yet
M5
40 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
6 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
51 pages
Unsuper L
No ratings yet
Unsuper L
26 pages
ML - 8
No ratings yet
ML - 8
70 pages
Chatgpt Unit - 4
No ratings yet
Chatgpt Unit - 4
4 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
Unit 5
No ratings yet
Unit 5
10 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
DWM PT 2 QB Soln
No ratings yet
DWM PT 2 QB Soln
8 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
69 pages
Unit V Machine Learning
No ratings yet
Unit V Machine Learning
5 pages
DM After Midz
No ratings yet
DM After Midz
22 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
13 pages
Exp5 - Unsupervised Learning
No ratings yet
Exp5 - Unsupervised Learning
13 pages
Clustering
No ratings yet
Clustering
65 pages
Clustering - Unit 4
No ratings yet
Clustering - Unit 4
19 pages
ML Assign4
No ratings yet
ML Assign4
7 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
(3rd Year) Pattern REcognition Lecture 4
No ratings yet
(3rd Year) Pattern REcognition Lecture 4
48 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
Clustering
No ratings yet
Clustering
14 pages
Unit 2
No ratings yet
Unit 2
33 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
SSRN Id3768295
No ratings yet
SSRN Id3768295
7 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
K Mean
No ratings yet
K Mean
5 pages
Clustering
No ratings yet
Clustering
11 pages
Unsupervised Learning Guide
No ratings yet
Unsupervised Learning Guide
50 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Unit 4 Cluster Analysis 4
No ratings yet
Unit 4 Cluster Analysis 4
25 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
ML CH 4
No ratings yet
ML CH 4
65 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Week 10
No ratings yet
Week 10
84 pages
Ds Econtent
No ratings yet
Ds Econtent
8 pages
Understanding Clustering - A Comprehensive Guide To
No ratings yet
Understanding Clustering - A Comprehensive Guide To
5 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
L07 Clustering Algorithms
No ratings yet
L07 Clustering Algorithms
45 pages
Clustering & Association Mining Basics
No ratings yet
Clustering & Association Mining Basics
50 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
No ratings yet
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
63 pages
Unit 3 HDFS
No ratings yet
Unit 3 HDFS
26 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
7 pages
Compiler Unit 5 Notes
No ratings yet
Compiler Unit 5 Notes
20 pages
Logistic Regression
No ratings yet
Logistic Regression
5 pages
Eel 5661 F22 HW4 111122
No ratings yet
Eel 5661 F22 HW4 111122
4 pages
DLCV Day2
No ratings yet
DLCV Day2
5 pages
4 - Four Queens Problem
No ratings yet
4 - Four Queens Problem
2 pages
CV Exam-Mid Spring2024 Solution
No ratings yet
CV Exam-Mid Spring2024 Solution
6 pages
III Test Dsa
No ratings yet
III Test Dsa
50 pages
Image Processing and Its Application
No ratings yet
Image Processing and Its Application
26 pages
Problem Set 3
No ratings yet
Problem Set 3
9 pages
Data Structures Using C by Vishal Kushwaha
No ratings yet
Data Structures Using C by Vishal Kushwaha
99 pages
Analog LPF Butterworh Lecture
No ratings yet
Analog LPF Butterworh Lecture
60 pages
ELEN90055 Control Systems: Midsemester Test
No ratings yet
ELEN90055 Control Systems: Midsemester Test
2 pages
Adsa Unit-I Mcqs
No ratings yet
Adsa Unit-I Mcqs
4 pages
Normalization With Decimal Scaling in Data Mining - Examples Data Mining
No ratings yet
Normalization With Decimal Scaling in Data Mining - Examples Data Mining
4 pages
Graph Data Structures Guide
No ratings yet
Graph Data Structures Guide
34 pages
DSA Lec 15
No ratings yet
DSA Lec 15
27 pages
Unite 4-Greedy Method - CSE
No ratings yet
Unite 4-Greedy Method - CSE
41 pages
Numerical Analysis Durham UNI
No ratings yet
Numerical Analysis Durham UNI
87 pages
Data Structures Quiz for Students
100% (1)
Data Structures Quiz for Students
9 pages
A Branch and Cut Algorithm For The Dial2
No ratings yet
A Branch and Cut Algorithm For The Dial2
25 pages
Daa 4
No ratings yet
Daa 4
26 pages
Cheat Sheet: ECE 318: Prepared by Walid Abediseid 1 Modulation Schemes
No ratings yet
Cheat Sheet: ECE 318: Prepared by Walid Abediseid 1 Modulation Schemes
1 page
LDPC Codes
No ratings yet
LDPC Codes
3 pages
Advanced Regression and Calculus Tasks
No ratings yet
Advanced Regression and Calculus Tasks
3 pages
CH02 PDF
No ratings yet
CH02 PDF
108 pages
Final Quiz 2 - Attempt Review
No ratings yet
Final Quiz 2 - Attempt Review
6 pages
Duval
No ratings yet
Duval
9 pages
Data Cleaning and Datamining
No ratings yet
Data Cleaning and Datamining
54 pages
Infinite Horizon Linear Quadratic Regulator
No ratings yet
Infinite Horizon Linear Quadratic Regulator
11 pages
Lec05 Multi Armed Bandit
No ratings yet
Lec05 Multi Armed Bandit
4 pages
Ees 400 - Topic Four - Multivariate Regression Analysis
No ratings yet
Ees 400 - Topic Four - Multivariate Regression Analysis
9 pages
Chapter 11 C
No ratings yet
Chapter 11 C
44 pages

Clustering Analysis

Uploaded by

Clustering Analysis

Uploaded by

Clustering Analysis

Each cluster is formed such that:

 Data points in different clusters are significantly different.

 Customer segmentation in marketing.

 Image compression and pattern recognition.

 Anomaly detection in network security.

 Grouping genes in biological data analysis.

o Normalize data to ensure all features are on the same scale.

o Handle missing values and outliers.

2. Choose a Clustering Algorithm:

3. Define a Similarity/Dissimilarity Metric:

o Common metrics: Euclidean distance, Manhattan distance.

4. Run the Algorithm:

o Execute the clustering process iteratively.

How k-means clustering works?

 Select the number of clusters, K.

 Repeat the Assignment and Update steps until:

o The cluster assignments no longer change, or

o The centroids stabilize, or

o A predefined maximum number of iterations is reached.

Steps to Apply the Elbow Method:

1. Run K-Means for a Range of K:

2. Calculate the WCSS (Within-Cluster Sum of Squares):

3. Plot K vs. WCSS:

4. Identify the "Elbow Point":

o The K corresponding to the elbow is considered the optimal number of clusters.

There are two main approaches:

o Start with each data point as its own cluster.

o Start with all data points in a single cluster.

1. Compute a Distance Matrix:

2. Linkage Criteria (for Agglomerative):

o Single Linkage: Distance between the nearest points in two clusters.

o Complete Linkage: Distance between the farthest points in two clusters.

o Average Linkage: Average distance between all points in two clusters.

o Centroid Linkage: Distance between centroids of two clusters.

o Ward Linkage: minimizes the total within-cluster variance

o A dendrogram is a tree-like structure that shows the sequence of merges or splits.

o The y-axis represents the distance or dissimilarity between clusters.

o Cut the dendrogram at a chosen distance to form clusters.

 Does not require the number of clusters to be pre-specified.

 Produces a hierarchy, offering insights into cluster relationships.

 Computationally expensive for large datasets (time complexity O(n3)O(n^3)O(n3)).

 Sensitive to noise and outliers.

How DBSCAN Works:

1. Core Points, Border Points, and Noise Points:

o Points that are outliers (noise) do not belong to any cluster.

 Noise Handling: It naturally identifies outliers as noise points.

Principal Component Analysis (PCA)

How PCA Works:

1. Standardize the Data:

2. Compute the Covariance Matrix:

3. Calculate Eigenvalues and Eigenvectors:

o The covariance matrix is decomposed into eigenvalues and eigenvectors.

 Eigenvalues: The magnitude of variance along each principal component. The

4. Sort Eigenvalues and Eigenvectors:

5. Select the Top k Principal Components:

6. Project the Data:

 Visualization: Reducing dimensions to 2 or 3 allows for easy visualization of data.

You might also like