0% found this document useful (0 votes)

10 views20 pages

Unit 4 Cluster Analysis 3

The document discusses cluster analysis, focusing on partitioning and hierarchical methods for grouping similar data objects. It explains techniques like K-means, K-medoids, and CLARA for efficient clustering, as well as agglomerative and divisive hierarchical clustering methods. The document highlights applications, advantages, and drawbacks of these clustering techniques, providing examples for better understanding.

Uploaded by

drajalakshmi2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views20 pages

Unit 4 Cluster Analysis 3

Uploaded by

drajalakshmi2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Unit-4 - Cluster

Analysis
Date

Lecture #1: Cluster Analysis: Partitioning

Methods- Hierarchical Methods:
Cluster: A group of similar data objects that are distinct from objects in other groups.
Cluster Analysis: Identifies similarities in data and groups similar objects together.
Unsupervised Learning: No predefined classes; learning is based on patterns in data
rather than labeled examples.
Applications:

● Used for understanding data distribution.

● Serves as a preprocessing step for other algorithms.

Cluster Analysis Example:

1. Scenario: A retail company segments customers based on annual income and
spending behavior.
2. Dataset: Includes customer income and spending scores.
3. Choose Clustering Method and the Number of clusters: (K-Means:(K=3))
i. Assign initial cluster centers.
ii. Group customers based on proximity to centroids.
iii. Recalculate centroids and repeat until stable clusters form.
4. Result:
○ Cluster 1: High-income, low-spenders.
○ Cluster 2: Low-income, high-spenders.
○ Cluster 3: Middle-income, moderate-spenders.
5. Application: Helps businesses target marketing campaigns effectively.

Partitioning vs. Hierarchical Methods

Cluster analysis is a technique used to group similar objects into clusters. Two primary
methods for clustering are partitioning methods and hierarchical methods.

1. Partitioning Methods: These methods divide the dataset into a predefined number
of clusters. Example: k-means clustering.
2. Hierarchical Methods: These build a tree-like structure of clusters, which can be
agglomerative (bottom-up) or divisive (top-down).

Partitioning Clustering Methods

K-Means
K-means is a popular clustering algorithm used to partition a dataset into K clusters. It
works by grouping similar data points together and minimizing the variance within each
cluster.

Example:

Let’s say we have the following 2D data points: (2,3),(3,3),(6,5),(8,8),(3,2),(5,7)

We want to cluster them into K=2 clusters.

# CHECK HOW MANY ITERATIONS IT TAKES TO CONVERGE

K-medoids is a clustering algorithm similar to K-means, but instead of

using the mean of the points in a cluster as the centroid, it uses an
actual point from the dataset as the "medoid" of the cluster. The medoid
is the point that minimizes the sum of distances to all other points in
the cluster.

Step 2: Assign each data point to the nearest medoid

Now, we assign each point to the cluster associated with the nearest
medoid by calculating the distance between the data points and the
medoids. We'll use the Manhattan distance for simplicity, which is the
sum of the absolute differences of their coordinates.
# CHECK HOW MANY ITERATIONS IT TAKES TO CONVERGE

CLARA (Clustering Large Applications) Algorithm :

CLARA (Clustering Large Applications) is an extension of PAM (Partitioning Around
Medoids) designed to handle large datasets efficiently by working on samples instead of
the entire dataset.
Dataset:
K=2 Clusters

Step 1: Initialize Parameters

● k = 2 (number of clusters).

● Sample size = 5 (choose 5 points randomly from 7).

● Number of samples = 2 (CLARA repeats the process multiple times).

Step 2: Draw a Random Sample of Points: Sample 1: {P1, P2, P4, P6, P7}
Apply PAM (Partitioning Around Medoids) on this sample to find the best k=2 medoids.
Step 3: Apply PAM on the Sample

Choose Initial Medoids

Let’s select P2 (5,4) and P6 (7,5) as initial medoids.

Step 4: Compute the Total Cost (Sum of Distances)
Cost = Sum of distances of each non-medoid to its medoid.

Step 5: Repeat for Another Sample

Since CLARA works with multiple random samples, we now select a second sample and
repeat the process.

New Sample (Sample 2)

Let's randomly pick a new subset of 5 points:

Sample 2 = {P1, P3, P5, P6, P7}

Now, we apply PAM (Partitioning Around Medoids) on this sample.

Step 6: Choose New Medoids for Sample 2 :say P3 (9,6) and P5 (8,1) as initial medoids.

Step 8: Compute the Total Cost for Sample 2

Step 10: Assign All Points to Final Medoids

Final clusters:

● Cluster 1 (P2 medoid): {P1, P2, P4}

● Cluster 2 (P6 medoid): {P3, P5, P6, P7}

❖ CLARA efficiently selects the best medoids by working with multiple samples
instead of processing the full dataset.
❖ The final clustering is determined based on the lowest cost among all samples.
❖ This approach makes CLARA scalable for large datasets compared to the costly
PAM algorithm.

Drawbacks of CLARA (Clustering Large Applications)

● If the sample is not representative of the entire dataset, the resulting clusters may
be suboptimal.
● Important outliers or dense regions in the full dataset may be missed in the
sampling process.
● While CLARA is more efficient than PAM, it still applies PAM to multiple samples,
making it computationally expensive.

Hierarchical Clustering:
Hierarchical clustering is a clustering method that builds a hierarchy of clusters. Unlike
k-means or k-medoids, it does not require the number of clusters (k) to be predefined.
It produces a tree-like structure called a dendrogram, which helps visualize how
clusters are merged or split.

Types of Hierarchical Clustering

1. Agglomerative (Bottom-Up)

○ Starts with each data point as its own cluster.

○ Iteratively merges the closest clusters until only one cluster remains.

○ Most commonly used.

2. Divisive (Top-Down)

○ Starts with all data points in a single cluster.

○ Iteratively splits clusters until each point is its own cluster.

○ Less common and computationally expensive.

Steps in Agglomerative Hierarchical Clustering

1. Compute Distance Matrix

○ Calculate pairwise distances between all points (e.g., using Euclidean

distance).

2. Merge Closest Clusters

○ Find the two closest clusters and merge them into one.

3. Update Distance Matrix

○ Recalculate distances between the new cluster and remaining clusters.

4. Repeat Until One Cluster Remains

○ Continue merging until all points are in a single cluster.

5. Create Dendrogram and Choose Final Clusters

○ Cut the dendrogram at a chosen threshold to determine the final number of

clusters.

Example

Step 2: Merge the Closest Clusters

● The smallest distance = 3.16 (P1 ↔ P2 and P2 ↔ P4).

● Merge P2 and P4 into a new cluster C1.

Step 3: Recalculate Distance Matrix

Use linkage methods to calculate the new cluster distance:

1. Single Linkage (minimum distance)

2. Complete Linkage (maximum distance)

3. Average Linkage (average distance)

4. Centroid Linkage (distance between centroids)

Step 4: Repeat Until All Clusters Merge

● Merge P1 and C1 (smallest distance = 3.16).

● Then merge P3 and P5 (smallest distance = 5.10).

● Continue merging until all points form a single cluster.

Dendrogram and Choosing Clusters

A dendrogram visualizes the clustering process. The height of the merge represents the
distance between clusters.

To determine the final clusters:

● Cut the dendrogram at a chosen height.

● Example: If we cut at height 4.5, we get two clusters:

○ Cluster 1: {P1, P2, P4}

○ Cluster 2: {P3, P5}

Advantages of Hierarchical Clustering

No Need to Predefine k

Produces a Dendrogram
Works for Non-Spherical Clusters
Can Be Applied to Small Datasets

Drawbacks
Computationally Expensive
Not Suitable for Large Datasets
Sensitive to Noise and Outliers
Example of Complete Linkage Clustering
=> =>

==>
#TRY single linkage dendrogram for the same distance matrix.
Divisive clustering is a type of hierarchical clustering where the algorithm starts with the
entire dataset as a single cluster and recursively splits it into smaller clusters until each
cluster contains only one element. This process contrasts with agglomerative clustering,
where the algorithm begins with individual data points as clusters and merges them
together.
Example

Consider the following 6 data points, represented as 1-dimensional points :

{10,20,30,40,50,60}

Step 1: Start with the entire dataset as a single cluster.

● Initially, all the data points { 10, 20, 30, 40, 50, 60 } are grouped together in one
cluster.

Step 2: Calculate the "best" way to split the cluster.

In this case, let’s split by using the median value of the data.

For the dataset { 10, 20, 30, 40, 50, 60 }, the median value is 35, which divides the
data into:
● Cluster 1: { 10, 20, 30 }

● Cluster 2: { 40, 50, 60 }

Step 3: Recursively divide the subclusters.

Let’s first split { 10, 20, 30 }:

● The median of { 10, 20, 30 } is 20.

● This results in two subclusters:

○ Cluster 1A: { 10 }

○ Cluster 1B: { 20, 30 }

Now, split { 40, 50, 60 }:

● The median of { 40, 50, 60 } is 50.

● This results in two subclusters:

○ Cluster 2A: { 40 }

○ Cluster 2B: { 50, 60 }

Step 4: Keep splitting until each cluster has only one element.

● We continue to recursively split clusters until each cluster contains a single data
point. Let’s split { 20, 30 } and { 50, 60 }.

● The median of { 20, 30 } is 25:

○ Cluster 1B1: { 20 }

○ Cluster 1B2: { 30 }

● The median of { 50, 60 } is 55:

○ Cluster 2B1: { 50 }

○ Cluster 2B2: { 60 }
Now all the clusters have just one data point.

Step 5: The final result.

● The final result of divisive clustering would look like this:

○ Cluster 1A: { 10 }

○ Cluster 1B1: { 20 }

○ Cluster 1B2: { 30 }

○ Cluster 2A: { 40 }

○ Cluster 2B1: { 50 }

○ Cluster 2B2: { 60 }

# TRY Using Divisive clustering cluster the following 2D points

{(2,3),(3,3),(6,6),(8,8),(10,8),(11,9)}

Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
K-Means Clustering Guide
100% (1)
K-Means Clustering Guide
14 pages
Unit-4 New
No ratings yet
Unit-4 New
36 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Unit 4 ML
No ratings yet
Unit 4 ML
14 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
Cluster
100% (1)
Cluster
72 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
Partition
No ratings yet
Partition
52 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Week 10
No ratings yet
Week 10
84 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
Unit 5
No ratings yet
Unit 5
10 pages
Unit 4 Self Made
No ratings yet
Unit 4 Self Made
28 pages
Lecture-02 Unsupervised Learning Algorithm (Clustering)
No ratings yet
Lecture-02 Unsupervised Learning Algorithm (Clustering)
60 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
No ratings yet
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
27 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Cluster
No ratings yet
Cluster
20 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
10 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Module 5
No ratings yet
Module 5
43 pages
Unsupervised Learning: Clustering Algorithms
No ratings yet
Unsupervised Learning: Clustering Algorithms
13 pages
Intro to Cluster Analysis
No ratings yet
Intro to Cluster Analysis
24 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
Cluster Analysis Overview
No ratings yet
Cluster Analysis Overview
77 pages
Unsupervised Learning Part 1
No ratings yet
Unsupervised Learning Part 1
9 pages
Unit5 CSM ML
No ratings yet
Unit5 CSM ML
32 pages
3.k-Metoids and Hierarchical Updated
No ratings yet
3.k-Metoids and Hierarchical Updated
50 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
Unit 5 Cluster Analysis
No ratings yet
Unit 5 Cluster Analysis
15 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Ds Un4
No ratings yet
Ds Un4
11 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
6 pages
Data Analytics for B.Tech Students
No ratings yet
Data Analytics for B.Tech Students
98 pages
M5
No ratings yet
M5
40 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
51 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
69 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Clustering
No ratings yet
Clustering
6 pages
Clustering Algorithm: A Fundamental Operation in Data Mining
No ratings yet
Clustering Algorithm: A Fundamental Operation in Data Mining
44 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
Clustering
No ratings yet
Clustering
29 pages
Clustering
No ratings yet
Clustering
38 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
No ratings yet
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
31 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Clustering Techniques Explained
No ratings yet
Clustering Techniques Explained
11 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Ch7 Refactoring
No ratings yet
Ch7 Refactoring
69 pages
GCLUTO - An Interactive Clustering, Visualization, and Analysis System
No ratings yet
GCLUTO - An Interactive Clustering, Visualization, and Analysis System
10 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
38 pages
Bailey Et Al 2018 Social Connectedness Measurement Determinants and Effects
No ratings yet
Bailey Et Al 2018 Social Connectedness Measurement Determinants and Effects
42 pages
Assignment: Presented by Jikku Varughese Koruth S3, MBA - Daksha
No ratings yet
Assignment: Presented by Jikku Varughese Koruth S3, MBA - Daksha
13 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
50 pages
Customer Clustering Insights
50% (2)
Customer Clustering Insights
33 pages
Analysis and Optimization of Data Classification Using K-Means Clustering and Affinity Propagation Technique
No ratings yet
Analysis and Optimization of Data Classification Using K-Means Clustering and Affinity Propagation Technique
9 pages
Query Directed Web Page Clustering
No ratings yet
Query Directed Web Page Clustering
9 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Unit 5
No ratings yet
Unit 5
27 pages
EV Charging Anomaly Clustering
No ratings yet
EV Charging Anomaly Clustering
6 pages
Multivariate Analysis Guide
No ratings yet
Multivariate Analysis Guide
22 pages
Student's Behavior Clustering Based On Ubiquitous Learning Log Data Using Unsupervised Machine Learning
No ratings yet
Student's Behavior Clustering Based On Ubiquitous Learning Log Data Using Unsupervised Machine Learning
7 pages
Practice Quention Bank IA-2 - BDA
No ratings yet
Practice Quention Bank IA-2 - BDA
40 pages
Data-Clustering (Part I)
No ratings yet
Data-Clustering (Part I)
74 pages
Seminar Report Maddu Ravindra 19103335 - Ravindra Babu
No ratings yet
Seminar Report Maddu Ravindra 19103335 - Ravindra Babu
21 pages
DM Mod5
No ratings yet
DM Mod5
49 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
93 pages
Stock Market Analysis Using Clustering Techniques - The Impact of FO On Stock Volatility in Vietnam - Van Dai Ta
No ratings yet
Stock Market Analysis Using Clustering Techniques - The Impact of FO On Stock Volatility in Vietnam - Van Dai Ta
8 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
96 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Unit 5 - Cluster Analysis
No ratings yet
Unit 5 - Cluster Analysis
14 pages
6th - SEM Data Science Notes
No ratings yet
6th - SEM Data Science Notes
46 pages
Data Mining & Clustering Guide
No ratings yet
Data Mining & Clustering Guide
3 pages
Clustering: Hierarchical vs Partitional
No ratings yet
Clustering: Hierarchical vs Partitional
3 pages
L07 - Advance Analytical Theory and Methods - Clustering
No ratings yet
L07 - Advance Analytical Theory and Methods - Clustering
22 pages
Improving Test Case Generation For REST APIs Through Hierarchical Clustering
No ratings yet
Improving Test Case Generation For REST APIs Through Hierarchical Clustering
12 pages

Unit 4 Cluster Analysis 3

Uploaded by

Unit 4 Cluster Analysis 3

Uploaded by

Unit-4 - Cluster

Lecture #1: Cluster Analysis: Partitioning

●​ Used for understanding data distribution.

Cluster Analysis Example:

Partitioning vs. Hierarchical Methods

Partitioning Clustering Methods

Let’s say we have the following 2D data points: (2,3),(3,3),(6,5),(8,8),(3,2),(5,7)

We want to cluster them into K=2 clusters.

K-medoids is a clustering algorithm similar to K-means, but instead of

Step 2: Assign each data point to the nearest medoid

CLARA (Clustering Large Applications) Algorithm :

Step 1: Initialize Parameters

●​ Sample size = 5 (choose 5 points randomly from 7).​

●​ Number of samples = 2 (CLARA repeats the process multiple times).

Choose Initial Medoids

Let’s select P2 (5,4) and P6 (7,5) as initial medoids.

Step 5: Repeat for Another Sample

New Sample (Sample 2)

Let's randomly pick a new subset of 5 points:​

Now, we apply PAM (Partitioning Around Medoids) on this sample.

Step 8: Compute the Total Cost for Sample 2

●​ Cluster 1 (P2 medoid): {P1, P2, P4}​

●​ Cluster 2 (P6 medoid): {P3, P5, P6, P7}​

Drawbacks of CLARA (Clustering Large Applications)

Types of Hierarchical Clustering

○​ Starts with each data point as its own cluster.​

○​ Most commonly used.​

2.​ Divisive (Top-Down)​

○​ Starts with all data points in a single cluster.​

○​ Iteratively splits clusters until each point is its own cluster.​

○​ Less common and computationally expensive.

Steps in Agglomerative Hierarchical Clustering

○​ Calculate pairwise distances between all points (e.g., using Euclidean

2.​ Merge Closest Clusters​

3.​ Update Distance Matrix​

○​ Recalculate distances between the new cluster and remaining clusters.​

4.​ Repeat Until One Cluster Remains​

○​ Continue merging until all points are in a single cluster.​

○​ Cut the dendrogram at a chosen threshold to determine the final number of

Step 2: Merge the Closest Clusters

●​ The smallest distance = 3.16 (P1 ↔ P2 and P2 ↔ P4).

Step 3: Recalculate Distance Matrix

Use linkage methods to calculate the new cluster distance:

1.​ Single Linkage (minimum distance)​

2.​ Complete Linkage (maximum distance)​

3.​ Average Linkage (average distance)​

4.​ Centroid Linkage (distance between centroids)

●​ Merge P1 and C1 (smallest distance = 3.16).​

●​ Then merge P3 and P5 (smallest distance = 5.10).​

●​ Continue merging until all points form a single cluster.

Dendrogram and Choosing Clusters

To determine the final clusters:

●​ Cut the dendrogram at a chosen height.​

○​ Cluster 1: {P1, P2, P4}​

○​ Cluster 2: {P3, P5}

Advantages of Hierarchical Clustering

​No Need to Predefine k

Consider the following 6 data points, represented as 1-dimensional points :

Step 1: Start with the entire dataset as a single cluster.

Step 2: Calculate the "best" way to split the cluster.

●​ Cluster 2: { 40, 50, 60 }

Step 3: Recursively divide the subclusters.

Let’s first split { 10, 20, 30 }:

●​ The median of { 10, 20, 30 } is 20.​

●​ This results in two subclusters:​

○​ Cluster 1B: { 20, 30 }

Now, split { 40, 50, 60 }:

●​ The median of { 40, 50, 60 } is 50.​

●​ This results in two subclusters:​

○​ Cluster 2B: { 50, 60 }

●​ The median of { 20, 30 } is 25:​

●​ The median of { 50, 60 } is 55:​

Step 5: The final result.

●​ The final result of divisive clustering would look like this:​

# TRY Using Divisive clustering cluster the following 2D points

You might also like

● Used for understanding data distribution.

● Sample size = 5 (choose 5 points randomly from 7).

● Number of samples = 2 (CLARA repeats the process multiple times).

Let's randomly pick a new subset of 5 points:

● Cluster 1 (P2 medoid): {P1, P2, P4}

● Cluster 2 (P6 medoid): {P3, P5, P6, P7}

○ Starts with each data point as its own cluster.

○ Most commonly used.

2. Divisive (Top-Down)

○ Starts with all data points in a single cluster.

○ Iteratively splits clusters until each point is its own cluster.

○ Less common and computationally expensive.

○ Calculate pairwise distances between all points (e.g., using Euclidean

2. Merge Closest Clusters

3. Update Distance Matrix

○ Recalculate distances between the new cluster and remaining clusters.

4. Repeat Until One Cluster Remains

○ Continue merging until all points are in a single cluster.

○ Cut the dendrogram at a chosen threshold to determine the final number of

● The smallest distance = 3.16 (P1 ↔ P2 and P2 ↔ P4).

1. Single Linkage (minimum distance)

2. Complete Linkage (maximum distance)

3. Average Linkage (average distance)

4. Centroid Linkage (distance between centroids)

● Merge P1 and C1 (smallest distance = 3.16).

● Then merge P3 and P5 (smallest distance = 5.10).

● Continue merging until all points form a single cluster.

● Cut the dendrogram at a chosen height.

○ Cluster 1: {P1, P2, P4}

○ Cluster 2: {P3, P5}

No Need to Predefine k

● Cluster 2: { 40, 50, 60 }

● The median of { 10, 20, 30 } is 20.

● This results in two subclusters:

○ Cluster 1B: { 20, 30 }

● The median of { 40, 50, 60 } is 50.

● This results in two subclusters:

○ Cluster 2B: { 50, 60 }

● The median of { 20, 30 } is 25:

● The median of { 50, 60 } is 55:

● The final result of divisive clustering would look like this: