0% found this document useful (0 votes)

26 views29 pages

Clustering

Uploaded by

Sahil Pahuja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views29 pages

Clustering

Uploaded by

Sahil Pahuja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

What is Cluster Analysis?

• Finding groups of objects such that the objects in a group will

be similar (or related) to one another and different from (or
unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined classes
• Clustering is used:
– As a stand-alone tool to get insight into data distribution
• Visualization of clusters may unveil important information
– As a preprocessing step for other algorithms
• Efficient indexing or compression often relies on clustering
Some Applications of Clustering
• Pattern Recognition
• Image Processing
– cluster images based on their visual content
• Bio-informatics
• WWW and IR
– document classification
– cluster Weblog data to discover groups of similar access patterns
Applications of Cluster Analysis
• Understanding Discovered Clusters
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Industry Group

– Group related documents for 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN

browsing, group genes and Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,

Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
proteins that have similar 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Technology2-DOWN
functionality, or group stocks Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

with similar price fluctuations Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

• Summarization
– Reduce the size of large data
sets

Clustering
precipitation in
Australia
What is not Cluster Analysis?
• Supervised classification
– Have class label information

• Simple segmentation
– Dividing students into different registration groups alphabetically,
by last name

• Results of a query
– Groupings are a result of an external specification
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

What Is Good Clustering?
• A good clustering method will produce high quality clusters
with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its implementation
(i.e. algorithms used).
• The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
Major Clustering Approaches
• Partitioning algorithms: Construct random partitions and then iteratively
refine them by some criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set of
data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
Partitioning Algorithms: Basic Concept

• Partitioning method: Construct a partition of a database D of n

objects into a set of k clusters
– k-means (MacQueen’67): Each cluster is represented by the center of the
cluster. (center may or may not be the object of the cluster)
– k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in the
cluster
K-means Clustering

• Partition clustering approach

• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple
K-Means Algorithm

© Prentice Hall 11
K-means Clustering – Details
• Initial centroids are often chosen randomly.
– Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine similarity,
correlation, etc.
• Most of the convergence happens in the first few iterations.
– Often the stopping condition is changed to ‘Until relatively few points
change clusters’
• Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
K-Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2
• Randomly assign means: m1=3,m2=4
• K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
• K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
• K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
• K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
• Stop as the clusters with these means are the
same.

© Prentice Hall 13
Evaluating K-means Clusters
– For each point, the error is the distance to the nearest cluster
– To get SSE(sum of squared error), we square these errors and sum
them.
K
SSE    dist 2 (mi , x )
i 1 xCi

– x is a data point in cluster Ci and mi is the representative point for

cluster Ci
• can show that mi corresponds to the center (mean) of the cluster
– Given two clusters, we can choose the one with the smallest error
Limitations of K-means
• K-means has problems when clusters are of differing
– Sizes
– Densities
– Non-spherical shapes
• Resultant clusters formed depends on Initial centroids chosen.
• K-means has problems when the data contains outliers.
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids …

Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

Iteration 3 Iteration 4 Iteration 5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Solutions to Initial Centroids Problem

• Multiple runs
– For each run, compute SSE and consider the case with minimum
SSE.
• Sample and use hierarchical clustering to determine initial
centroids.
• Take the centroid of all the points. Then select the point farthest
from this initial centroid as another centroid.
• Select more than k initial centroids and then select among these
initial centroids
– Select most widely separated
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Kaufmann & Rousseeuw 1987) [Partitioning Around
Medoids]
– starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
– PAM works effectively for small data sets, but does not scale well for
large data sets
• CLARA (Kaufmann & Rousseeuw, 1990)[clustering large
Applications]
• CLARANS (Ng & Han, 1994): Randomized sampling [Clustering
Large Applications based upon RANdomized Search]
The K-Medoids Clustering Method

PAM (Kaufman and Rousseeuw, 1987)

• Arbitrarily select k objects as medoid
• Assign each data object in the given data set to most similar medoid.
• Randomly select nonmedoid object O’
• Compute total cost, S, of swapping a medoid object to O’ (cost as total
sum of absolute error)
• If S<0, then swap initial medoid with the new one
• Repeat until there is no change in the medoid.

k-medoids and (n-k) instances pair-wise comparison

December 5, 2024 Data Mining: Concepts and Techniques 23

PAM Clustering: Total swapping cost TCih=jCjih
• i is a current medoid, h is a non-selected object
• Assume that i is replaced by h in the set of
medoids
• TCih = 0;
• For each non-selected object j ≠ h:
– TCih += d(j,new_medj)-d(j,prev_medj):
• new_medj = the closest medoid to j after i is replaced by
h
• prev_medj = the closest medoid to j before i is replaced
by h
PAM Clustering: Total swapping cost TCih=jCjih
10 10

9 9
j
8
t 8
t
7 7

5
j 6

4
i h 4
h
3

2
3

2
i
1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, h) - d(j, i) Cjih = 0

10
10

9
9

8
h 8

j
7
7

6
6

5
5 i
i 4
h j
4

3
t 3

2
2

1
t
1

0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, t) - d(j, i) Cjih = d(j, h) - d(j, t)

PAM
• Partitioning Around Medoids (PAM) (K-
Medoids)
• Handles outliers well.
• Ordering of input does not impact results.
• Does not scale well.
• Each cluster represented by one item, called
the medoid.
• Initial set of k medoids randomly chosen.

© Prentice Hall 27
PAM Cost Calculation
• At each step in algorithm, medoids are changed if
the overall cost is improved.
• Cjih – cost change for an item tj associated with
swapping medoid ti with non-medoid th.

What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Intro to Cluster Analysis
No ratings yet
Intro to Cluster Analysis
90 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Cluster Analysis: Methods and Applications
No ratings yet
Cluster Analysis: Methods and Applications
14 pages
Unit 5
No ratings yet
Unit 5
85 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
Clustering
No ratings yet
Clustering
104 pages
Cluster Analysis Overview
No ratings yet
Cluster Analysis Overview
77 pages
Clustering
No ratings yet
Clustering
18 pages
Clustering
No ratings yet
Clustering
125 pages
10 Clus Basic
No ratings yet
10 Clus Basic
95 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Slide-08-Chapter10-Cluster Analysis Basic Concept I
No ratings yet
Slide-08-Chapter10-Cluster Analysis Basic Concept I
40 pages
Cluster Analysis for CS Students
No ratings yet
Cluster Analysis for CS Students
43 pages
Clustering
No ratings yet
Clustering
32 pages
Clustering
No ratings yet
Clustering
80 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Cluster
No ratings yet
Cluster
20 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Clustering Methods
No ratings yet
Clustering Methods
64 pages
Clustering
No ratings yet
Clustering
25 pages
Unit V - Clustering
No ratings yet
Unit V - Clustering
19 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Segment 7 (Ch10)
No ratings yet
Segment 7 (Ch10)
60 pages
M5
No ratings yet
M5
40 pages
Clustering Partition Hierachy
No ratings yet
Clustering Partition Hierachy
58 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
10 Clus Basic
No ratings yet
10 Clus Basic
66 pages
Lecture 3. Partitioning-Based Clustering Methods
No ratings yet
Lecture 3. Partitioning-Based Clustering Methods
27 pages
Cluster Analysis for Researchers
No ratings yet
Cluster Analysis for Researchers
76 pages
Unit IV
No ratings yet
Unit IV
96 pages
Clustering
No ratings yet
Clustering
24 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
Module 5
No ratings yet
Module 5
98 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Cluster
100% (1)
Cluster
72 pages
4.1 Clustering
No ratings yet
4.1 Clustering
69 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
07 Clustering
No ratings yet
07 Clustering
54 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Clustering
No ratings yet
Clustering
9 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Blockchain Technology Spectrum A Gartner Trend Insight Report
No ratings yet
Blockchain Technology Spectrum A Gartner Trend Insight Report
13 pages
Using Software Measurement in SLAs
No ratings yet
Using Software Measurement in SLAs
11 pages
Complete Project Cost Estimation
No ratings yet
Complete Project Cost Estimation
1 page
VDPLTechnical Paperfor Whats App AIChatbots
No ratings yet
VDPLTechnical Paperfor Whats App AIChatbots
8 pages
Unit 1
No ratings yet
Unit 1
31 pages
Internal Evaluation - IT762 Dissertation, 04316404523
No ratings yet
Internal Evaluation - IT762 Dissertation, 04316404523
37 pages
Calculationo Marks&ConversionCertificateforVarious 1 2
No ratings yet
Calculationo Marks&ConversionCertificateforVarious 1 2
2 pages
VND - Openxmlformats Officedocument - Wordprocessingml.document&rendition 1
No ratings yet
VND - Openxmlformats Officedocument - Wordprocessingml.document&rendition 1
24 pages
C4GT Community - Proposal - 1
No ratings yet
C4GT Community - Proposal - 1
7 pages
Admit Card
No ratings yet
Admit Card
1 page
Margadarsika 06470998ddbe4d9 41955371
No ratings yet
Margadarsika 06470998ddbe4d9 41955371
60 pages
Data Types
No ratings yet
Data Types
2 pages
Front Page Gold Text
No ratings yet
Front Page Gold Text
1 page
Unit 4 SPM
No ratings yet
Unit 4 SPM
8 pages
Unit 1
No ratings yet
Unit 1
21 pages
Sahil Pahuja End Term Seminar Report 1
No ratings yet
Sahil Pahuja End Term Seminar Report 1
15 pages
Instructions To Access Delhi Buses - GTFS-RT Feed - DTC & DIMTS
No ratings yet
Instructions To Access Delhi Buses - GTFS-RT Feed - DTC & DIMTS
3 pages
Hexaview - USICT - 2025 Time Slot - Online Assessment
No ratings yet
Hexaview - USICT - 2025 Time Slot - Online Assessment
3 pages
Codathon Questions
No ratings yet
Codathon Questions
11 pages
DL CNN
No ratings yet
DL CNN
129 pages
Presentation On: Neural Network
No ratings yet
Presentation On: Neural Network
30 pages
Clarans Clustering
No ratings yet
Clarans Clustering
26 pages
3.2 Grid Search
No ratings yet
3.2 Grid Search
28 pages
Weather Forecasting Basepaper
100% (1)
Weather Forecasting Basepaper
14 pages
Topic 08 - Data Modelling - Part II
No ratings yet
Topic 08 - Data Modelling - Part II
59 pages
Building A Convolutional Neural Network Using Tensorflow Keras
No ratings yet
Building A Convolutional Neural Network Using Tensorflow Keras
10 pages
Neural Networks for NLP Students
No ratings yet
Neural Networks for NLP Students
31 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
32 pages
5-Convolutional Neural Network
No ratings yet
5-Convolutional Neural Network
43 pages
Applied Machine Learning For Engineers: Artificial Neural Networks
0% (1)
Applied Machine Learning For Engineers: Artificial Neural Networks
6 pages
Neural Networks: Concepts & Applications
No ratings yet
Neural Networks: Concepts & Applications
5 pages
07 Clustering 2024
No ratings yet
07 Clustering 2024
51 pages
Deep Learning - IIT Ropar - Unit 13 - Week 10
No ratings yet
Deep Learning - IIT Ropar - Unit 13 - Week 10
4 pages
Agglomerative Clustering Guide
No ratings yet
Agglomerative Clustering Guide
3 pages
Plant Leaf Disease Detection Using Vision Transformer (Vit)
No ratings yet
Plant Leaf Disease Detection Using Vision Transformer (Vit)
17 pages
Soft Computing Roadmap
No ratings yet
Soft Computing Roadmap
3 pages
Chapter 3 - Neural Network
No ratings yet
Chapter 3 - Neural Network
47 pages
Week 6 SVM
No ratings yet
Week 6 SVM
18 pages
MultilayerPerceptron Chapter9
No ratings yet
MultilayerPerceptron Chapter9
13 pages
NNDL
No ratings yet
NNDL
10 pages
Deep Learning Question Bank
No ratings yet
Deep Learning Question Bank
8 pages
CSC354 ML CDF V3.1
No ratings yet
CSC354 ML CDF V3.1
2 pages
Rr720507 Neural Networks
No ratings yet
Rr720507 Neural Networks
5 pages
MAT6007 - Session6 - Multilayer Perceptrons
No ratings yet
MAT6007 - Session6 - Multilayer Perceptrons
13 pages
Experiment No. Brief Description About The Experiment Number of Slots
No ratings yet
Experiment No. Brief Description About The Experiment Number of Slots
1 page
ANN Case Study
No ratings yet
ANN Case Study
12 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
40 pages
05 Ensemble Learning
No ratings yet
05 Ensemble Learning
67 pages
AI Algorithms Explained To Kids
No ratings yet
AI Algorithms Explained To Kids
20 pages

Clustering

Uploaded by

Clustering

Uploaded by

What is Cluster Analysis?

• Finding groups of objects such that the objects in a group will

– Group related documents for 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

browsing, group genes and Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,

with similar price fluctuations Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

How many clusters? Six Clusters

Two Clusters Four Clusters

• Partitioning method: Construct a partition of a database D of n

• Partition clustering approach

– x is a data point in cluster Ci and mi is the representative point for

Original Points K-means (3 Clusters)

Original Points K-means (3 Clusters)

Original Points K-means (2 Clusters)

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

Iteration 4 Iteration 5 Iteration 6

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 3 Iteration 4 Iteration 5

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

PAM (Kaufman and Rousseeuw, 1987)

k-medoids and (n-k) instances pair-wise comparison

December 5, 2024 Data Mining: Concepts and Techniques 23

Cjih = d(j, h) - d(j, i) Cjih = 0

Cjih = d(j, t) - d(j, i) Cjih = d(j, h) - d(j, t)

You might also like