What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined classes
• Clustering is used:
– As a stand-alone tool to get insight into data distribution
• Visualization of clusters may unveil important information
– As a preprocessing step for other algorithms
• Efficient indexing or compression often relies on clustering
Some Applications of Clustering
• Pattern Recognition
• Image Processing
– cluster images based on their visual content
• Bio-informatics
• WWW and IR
– document classification
– cluster Weblog data to discover groups of similar access patterns
Applications of Cluster Analysis
• Understanding Discovered Clusters
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Industry Group
– Group related documents for 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
browsing, group genes and Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
proteins that have similar 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Technology2-DOWN
functionality, or group stocks Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
with similar price fluctuations Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
• Summarization
– Reduce the size of large data
sets
Clustering
precipitation in
Australia
What is not Cluster Analysis?
• Supervised classification
– Have class label information
• Simple segmentation
– Dividing students into different registration groups alphabetically,
by last name
• Results of a query
– Groupings are a result of an external specification
Notion of a Cluster can be Ambiguous
How many clusters? Six Clusters
Two Clusters Four Clusters
What Is Good Clustering?
• A good clustering method will produce high quality clusters
with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its implementation
(i.e. algorithms used).
• The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
Major Clustering Approaches
• Partitioning algorithms: Construct random partitions and then iteratively
refine them by some criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set of
data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters
– k-means (MacQueen’67): Each cluster is represented by the center of the
cluster. (center may or may not be the object of the cluster)
– k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in the
cluster
K-means Clustering
• Partition clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple
K-Means Algorithm
© Prentice Hall 11
K-means Clustering – Details
• Initial centroids are often chosen randomly.
– Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine similarity,
correlation, etc.
• Most of the convergence happens in the first few iterations.
– Often the stopping condition is changed to ‘Until relatively few points
change clusters’
• Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
K-Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2
• Randomly assign means: m1=3,m2=4
• K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
• K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
• K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
• K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
• Stop as the clusters with these means are the
same.
© Prentice Hall 13
Evaluating K-means Clusters
– For each point, the error is the distance to the nearest cluster
– To get SSE(sum of squared error), we square these errors and sum
them.
K
SSE dist 2 (mi , x )
i 1 xCi
– x is a data point in cluster Ci and mi is the representative point for
cluster Ci
• can show that mi corresponds to the center (mean) of the cluster
– Given two clusters, we can choose the one with the smallest error
Limitations of K-means
• K-means has problems when clusters are of differing
– Sizes
– Densities
– Non-spherical shapes
• Resultant clusters formed depends on Initial centroids chosen.
• K-means has problems when the data contains outliers.
Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means: Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means: Non-globular Shapes
Original Points K-means (2 Clusters)
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5
y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Iteration 4 Iteration 5 Iteration 6
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5
y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x
Iteration 3 Iteration 4 Iteration 5
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5
y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Solutions to Initial Centroids Problem
• Multiple runs
– For each run, compute SSE and consider the case with minimum
SSE.
• Sample and use hierarchical clustering to determine initial
centroids.
• Take the centroid of all the points. Then select the point farthest
from this initial centroid as another centroid.
• Select more than k initial centroids and then select among these
initial centroids
– Select most widely separated
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Kaufmann & Rousseeuw 1987) [Partitioning Around
Medoids]
– starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
– PAM works effectively for small data sets, but does not scale well for
large data sets
• CLARA (Kaufmann & Rousseeuw, 1990)[clustering large
Applications]
• CLARANS (Ng & Han, 1994): Randomized sampling [Clustering
Large Applications based upon RANdomized Search]
The K-Medoids Clustering Method
PAM (Kaufman and Rousseeuw, 1987)
• Arbitrarily select k objects as medoid
• Assign each data object in the given data set to most similar medoid.
• Randomly select nonmedoid object O’
• Compute total cost, S, of swapping a medoid object to O’ (cost as total
sum of absolute error)
• If S<0, then swap initial medoid with the new one
• Repeat until there is no change in the medoid.
k-medoids and (n-k) instances pair-wise comparison
December 5, 2024 Data Mining: Concepts and Techniques 23
PAM Clustering: Total swapping cost TCih=jCjih
• i is a current medoid, h is a non-selected object
• Assume that i is replaced by h in the set of
medoids
• TCih = 0;
• For each non-selected object j ≠ h:
– TCih += d(j,new_medj)-d(j,prev_medj):
• new_medj = the closest medoid to j after i is replaced by
h
• prev_medj = the closest medoid to j before i is replaced
by h
PAM Clustering: Total swapping cost TCih=jCjih
10 10
9 9
j
8
t 8
t
7 7
5
j 6
4
i h 4
h
3
2
3
2
i
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Cjih = d(j, h) - d(j, i) Cjih = 0
10
10
9
9
8
h 8
j
7
7
6
6
5
5 i
i 4
h j
4
3
t 3
2
2
1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Cjih = d(j, t) - d(j, i) Cjih = d(j, h) - d(j, t)
PAM
• Partitioning Around Medoids (PAM) (K-
Medoids)
• Handles outliers well.
• Ordering of input does not impact results.
• Does not scale well.
• Each cluster represented by one item, called
the medoid.
• Initial set of k medoids randomly chosen.
© Prentice Hall 26
PAM
© Prentice Hall 27
PAM Cost Calculation
• At each step in algorithm, medoids are changed if
the overall cost is improved.
• Cjih – cost change for an item tj associated with
swapping medoid ti with non-medoid th.
© Prentice Hall 28
PAM Algorithm
© Prentice Hall 29