Cluster Analysis Concepts & Algorithms
Cluster Analysis Concepts & Algorithms
and Algorithms
▪ Introduction
▪ Types of Clustering
▪ Types of Clusters
▪ Clustering Algorithms
—K-Means Clustering
—Hierarchical Clustering
—Density-based Clustering
▪ Cluster Validation
What is Cluster Analysis?
▪ Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated
to) the objects in other groups.
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
▪ Understanding 2
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
—Group related documents Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
for browsing, group genes Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
and proteins that have
similar functionality, or
3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
group stocks with similar
price fluctuations
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
▪ Summarization
—Reduce the size of large
data sets
Clustering precipitation
in Australia
What is not Cluster Analysis?
▪ Supervised classification
—Uses class label information
▪ Simple segmentation
—Dividing students into different registration groups alphabetically, by last
name
▪ Results of a query
—Groupings are a result of an external specification
▪ How do we measure
similarity/proximity/dissimilarity/distance?
▪ Examples
—Minkovsky distance: Manhattan distance, Euclidean Distance, etc.
—Jaccard index for binary data
—Gower's distance for mixed data (ratio/interval and nominal)
—Correlation coefficient as similarity between variables
Notion of a Cluster can be Ambiguous
▪ Introduction
▪ Types of Clustering
▪ Types of Clusters
▪ Clustering Algorithms
—K-Means Clustering
—Hierarchical Clustering
—Density-based Clustering
▪ Cluster Validation
Types of Clusterings
Exclusive versus non- Fuzzy versus non-fuzzy Partial versus complete Heterogeneous versus
exclusive homogeneous
In non-exclusive clusterings, In fuzzy clustering, a point In some cases, we only want Cluster of widely different
points may belong to multiple belongs to every cluster with to cluster some of the data sizes, shapes, and densities
clusters. some membership weight
between 0 and 1
Membership weights must
sum to 1
Probabilistic clustering has
similar characteristics
Topics
▪ Introduction
▪ Types of Clustering
▪ Types of Clusters
▪ Clustering Algorithms
—K-Means Clustering
—Hierarchical Clustering
—Density-based Clustering
▪ Cluster Validation
Types of Clusters
Center-
Contiguous
based
clusters
clusters
Density-
Conceptual
based
clusters
clusters
Center-based Clusters
Not well separated
Well separated (overlapping)
Cluster centers
A cluster is a set of objects such that an object in a cluster is closer (more
similar) to the “center” of a cluster, than to the center of any other cluster
The center of a cluster is often a centroid, the average of all the points in the
cluster, or a medoid, the most “representative” point of a cluster
Contiguous and Density-based Clusters
High density
Conceptual Clusters
Conceptual clusters are hard to detect since they are often not:
▪ Center-based
▪ Contiguity-based
▪ Density-based
Objective Functions
𝑆𝑆𝐸 = 𝒙 − 𝒎𝑖 2
𝑖=1 𝒙∈𝐶𝑖
𝒙 is a data point in cluster 𝐶𝑖 , 𝒎𝑖 is the center for cluster 𝐶𝑖 as the mean of all
points in the cluster and ⋅ is the L2 norm (= Euclidean distance).
We will talk about the objective functions when we talk about individual
clustering algorithms.
Topics
▪ Introduction
▪ Types of Clustering
▪ Types of Clusters
▪ Clustering Algorithms
—K-Means Clustering
—Hierarchical Clustering
—Density-based Clustering
▪ Cluster Validation
K-means Clustering
▪ Complexity is 𝑂(𝑛 𝐾 𝐼 𝑑 )
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
K-Means Example
Iteration 1 Iteration 2 Iteration 3
3 3 3
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Solutions to Initial Centroids Problem
▪ Select more than k initial centroids and then select among these
initial centroids the ones that are far away from each other.
Evaluating K-means Clusters
𝑆𝑆𝐸 = 𝒙 − 𝒎𝑖 2
𝑖=1 𝒙∈𝐶𝑖
▪ Pre-processing
—Normalize the data (e.g., scale to unit standard deviation)
—Eliminate outliers
▪ Post-processing
—Eliminate small clusters that may represent outliers
—Split ‘loose’ clusters, i.e., clusters with relatively high SSE
—Merge clusters that are ‘close’ and that have relatively low SSE
Limitations of K-means
Use a larger
number of clusters
Several clusters
represent a true
cluster
Overcoming K-means Limitations
Use a larger
number of clusters
Several clusters
represent a true
cluster
Topics
▪ Introduction
▪ Types of Clustering
▪ Types of Clusters
▪ Clustering Algorithms
—K-Means Clustering
—Hierarchical Clustering
—Density-based Clustering
▪ Cluster Validation
Hierarchical Clustering
Dendrogram
6 5
4 distance
3 4 0.2
2
5
2 0.15
1 0.1
3 1
0.05
0
1 3 2 5 4 6
Data points
Strengths of Hierarchical Clustering
—Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there are k clusters)
p4
p5
.
.
. Proximity Matrix
...
x
Dendrogram
Intermediate Situation
C1 C2 C3 C4 C5
▪ After some merging steps, we
C1
have some clusters
y C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
...
x C2 C5 C3
Dendrogram
Intermediate Situation
▪ We want to merge the two closest C1 C2 C3 C4 C5
clusters (C2 and C5) and update the C1
y proximity matrix.
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
...
x C2 C5 C3
Dendrogram
After Merging
C2
▪ The question is “How do we update C1
U
C5 C3 C4
the proximity matrix?”
y C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
...
x C2 C5 C3
Dendrogram
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity? p2
p3
p4
p5
.
▪ MIN .
▪ MAX
.
▪ Group Average Proximity Matrix
▪ Distance Between Centroids
▪ Other methods driven by an objective function
Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
▪ MIN (Single Link)
.
▪ MAX (Complete Link)
.
▪ Group Average .
▪ Distance Between Centroids
Proximity Matrix
p2
p3
p4
p5
▪ MIN (Single Link)
.
▪ MAX (Complete Link)
.
▪ Group Average .
▪ Distance Between Centroids
Proximity Matrix
p2
p3
p4
p5
▪ MIN (Single Link)
.
▪ MAX (Complete Link)
.
▪ Group Average (Average Link)
.
▪ Distance Between Centroids Proximity Matrix
▪ Other methods driven by an objective function
Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
▪ MIN (Single Link) .
▪ MAX (Complete Link) .
Proximity Matrix
▪ Group Average
▪ Distance Between Centroids
▪ Other methods driven by an objective function
Ward’s Method uses squared error
Single Link
▪ Introduction
▪ Types of Clustering
▪ Types of Clusters
▪ Clustering Algorithms
—K-Means Clustering
—Hierarchical Clustering
—Density-based Clustering
▪ Cluster Validation
DBSCAN
Density = 7 points
MinPts = 5
▪ Resistant to Noise
▪ Can handle clusters of different shapes and sizes
▪ Eps and MinPts depend on each other and can be hard to specify
When DBSCAN Does
NOT Work Well
(MinPts=4, Eps=9.75).
Original Points
▪ Varying densities
▪ High-dimensional data
(MinPts=4, Eps=9.92)
DBSCAN: Determining EPS and MinPts
▪ Idea is that for points in a cluster, their kth nearest neighbors are at
roughly the same distance
▪ Noise points have the kth nearest neighbor at farther distance
▪ So, plot sorted distance of every point to its kth nearest neighbor
MinPts = k
Eps
Some Other Clustering Algorithms
▪ Introduction
▪ Types of Clustering
▪ Types of Clusters
▪ Clustering Algorithms
—K-Means Clustering
—Hierarchical Clustering
—Density-based Clustering
▪ Cluster Validation
Cluster Validity
0.9 0.9
0.8 0.8
0.7 0.7
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 0.9
0.5 0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
▪ Order the similarity matrix with respect to cluster labels and inspect
visually.
1
1
10 0.9
0.9
20 0.8
0.8
30 0.7
0.7
40 0.6
0.6
Points
50 0.5
0.5
y
60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1
Points
x
Similarity Matrix Visualization for Cluster Validation
▪ Clusters in random data are not as crisp
1
1
0.9
10
DBSCAN 0.9
20 0.8
0.8
30 0.7
0.7
40 0.6
0.6
Points
50 0.5
0.5
y
60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1
Points
x
1 1
10 k-means 0.9 10
Complete L. HC 0.9
20 0.8 20 0.8
30 0.7 30 0.7
40 0.6 40 0.6
Points
Points
50 0.5 50 0.5
60 0.4 60 0.4
70 0.3 70 0.3
80 0.2 80 0.2
90 0.1 90 0.1
100 0 100 0
20 40 60 80 100 Similarity 20 40 60 80 100 Similarity
Points Points
Measuring Cluster Validity Via Correlation
▪ Two matrices
—Proximity Matrix representing the data
—Incidence Matrix representing the clustering
• One row and one column for each data point
• An entry is 1 if the associated pair of points belong to the same cluster
• An entry is 0 if the associated pair of points belongs to different clusters
▪ Compute the correlation between the two matrices
▪ High correlation indicates that points that belong to the same cluster
are close to each other.
▪ Not a good measure for some density or contiguity-based clusters
(e.g., single link HC).
Measuring Cluster Validity Via Correlation
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Note: Correlation is always negative between distance matrix and incidence matrix
Internal Measures: Cohesion and Separation
cohesion separation
Internal Measures: Sum of Squares
▪ Cluster Cohesion: Within cluster sum of squares (WSS=SSE) 𝐶𝑖
𝐾
2
𝑊𝑆𝑆 = 𝒙 − 𝒎𝑖 𝒎𝑖
𝑖=1 𝒙∈𝐶𝑖
2
▪ Total sum of squares: 𝑇𝑆𝑆 = σ𝒙 𝒙 − 𝒎
Look for
Data points forming 10 clusters the knee
y
x
Internal Measures: Silhouette Coefficient
cohesion separation
Internal Measures: Silhouette Coefficient
▪ Silhouette Coefficient combine ideas of both cohesion and separation, but
for individual points. For an individual point i:
—Calculate a(i) = average dissimilarity of i to all other points in its cluster
—Calculate b(i) = lowest average dissimilarity of i to any other)
a(i) i
b(i)
Cohersion Separation