Chapter 10.
Cluster Analysis: Basic Concepts and
Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
1
What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
Unsupervised learning: no predefined classes (i.e., learning by
observations vs. learning by examples: supervised)
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
2
What is not Cluster Analysis
Supervised classification: Have class label
information
Simple segmentation: Dividing students into different
registration groups alphabetically, by last name.
Results of a query: Groupings are a result of an
external specification.
Graph partitioning: Some mutual relevance and
synergy, but areas are not identical
3
Applications of Clustering
Biology: taxonomy of living things: kingdom, phylum, class, order, family,
genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth observation
database
Marketing: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs
City-planning: Identifying groups of houses according to their house type,
value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
Climate: understanding earth climate, find patterns of atmospheric and ocean
Economic Science: market resarch
4
Clustering as a Preprocessing Tool (Utility)
Summarization:
Preprocessing for regression, PCA, classification, and
association analysis
Compression:
Image processing
Finding K-nearest Neighbors
Localizing search to one or a small number of clusters
Outlier detection
Outliers are often viewed as those “far away” from any
cluster
5
Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
the similarity measure used by the method
its implementation, and
Its ability to discover some or all of the hidden patterns
6
Measure the Quality of Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
Weights should be associated with different variables based
on applications and data semantics
Quality of clustering:
There is usually a separate “quality” function that measures
the “goodness” of a cluster.
It is hard to define “similar enough” or “good enough”
The answer is typically highly subjective
7
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
Similarity measure
Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-
based (e.g., density or contiguity)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
8
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
High dimensionality
9
Major Clustering Approaches (I)
Partitioning approach:
Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using
some criterion
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
10
Major Clustering Approaches (II)
Model-based:
A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: p-Cluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific
constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects are often linked together in various ways
Massive links can be used to cluster objects: SimRank, LinkClus
11
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
12
Partitioning Algorithms: Basic Concept
Partitioning method: Partitioning a database D of n objects into a set of k
clusters, such that the sum of squared distances is minimized (where c i is the
centroid or medoid of cluster Ci)
E ik1 pCi ( p ci ) 2
Given k, find a partition of k clusters that optimizes the chosen partitioning
criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the
center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in the
cluster
13
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in four
steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the clusters of
the current partitioning (the centroid is the center, i.e.,
mean point, of the cluster)
Assign each object to the cluster with the nearest seed
point
Go back to Step 2, stop when the assignment does not
change
14
An Example of K-Means Clustering
K=2
Arbitrarily Update
partition the
objects cluster
into k centroids
groups
The initial data Loop if
set Reassign objects
needed
Partition objects into k nonempty
subsets
Repeat
Compute centroid (i.e., mean Update
the
point) for each partition cluster
Assign each object to the centroids
cluster of its nearest centroid
Until no change
15
Comments on the K-Means Method
Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is #
iterations. Normally, k, t << n.
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
Comment: Often terminates at a local optimal.
Weakness
Applicable only to objects in a continuous n-dimensional space
Using the k-modes method for categorical data
In comparison, k-medoids can be applied to a wide range of data
Need to specify k, the number of clusters, in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
Sensitive to noisy data and outliers
Not suitable to discover clusters with non-convex shapes
16
Variations of the K-Means Method
Most of the variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes
Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical objects
Using a frequency-based method to update modes of clusters
A mixture of categorical and numerical data: k-prototype method
17
What Is the Problem of the K-Means Method?
The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially distort the
distribution of the data
18
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
19
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does
not require the number of clusters k as an input, but needs a
termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
20
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical packages, e.g., Splus
Use the single-link method and the dissimilarity matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
21
Dendrogram: Shows How Clusters are
Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram
A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster
22
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
23
Distance between Clusters X X
Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dist(K i, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dist(K i, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and an element in
the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters, i.e., dist(K i, Kj) =
dist(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e., dist(K i, Kj) =
dist(Mi, Mj)
Medoid: a chosen, centrally located object in the cluster
24
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
Centroid: the “middle” of a cluster iN1(t )
Cm N ip
Radius: square root of average distance from any point of the
cluster to its centroid N (t cm ) 2
Rm i 1 ip
N
Diameter: square root of average mean squared distance
between all pairs of points in the cluster
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
25
Perform Agglomerative clustering using single linkage
method and show the dendogram.
P1 p2 P3 p4 p5
P1 0
P2 9 0
P3 3 7 0
P4 6 5 9 0
p5 11 10 2 8 0
26
Extensions to Hierarchical Clustering
Major weakness of agglomerative clustering methods
Can never undo what was done previously
Do not scale well: time complexity of at least O(n2), where n is
the number of total objects
Integration of hierarchical & distance-based clustering
BIRCH (1996): uses CF-tree and incrementally adjusts the
quality of sub-clusters
CHAMELEON (1999): hierarchical clustering using dynamic
modeling
27
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Evaluation of Clustering
Summary
28
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as
density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
29
Density-Based Clustering: Basic Concepts
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
p belongs to NEps(q)
p MinPts = 5
core point condition:
Eps = 1 cm
|NEps (q)| ≥ MinPts q
30
Density-Reachable and Density-Connected
Density-reachable:
A point p is density-reachable from a p
point q w.r.t. Eps, MinPts if there is a p1
chain of points p1, …, pn, p1 = q, pn = q
p such that pi+1 is directly density-
reachable from pi
Density-connected
p q
A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is a
o
point o such that both, p and q are
density-reachable from o w.r.t. Eps
and MinPts
31
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases with
noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
32
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
If p is a core point, a cluster is formed
If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database
Continue the process until all of the points have been processed
33
34
35
36
37
38
39
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Evaluation of Clustering
Summary
40
Assessing Clustering Tendency
Assess if non-random structure exists in the data by measuring the probability
that the data is generated by a uniform data distribution
Test spatial randomness by statistic test: Hopkins Static
Given a dataset D regarded as a sample of a random variable o, determine
how far away o is from being uniformly distributed in the data space
Sample n points, p , …, p , uniformly from D. For each p , find its nearest
1 n i
neighbor in D: xi = min{dist (pi, v)} where v in D
Sample n points, q1, …, qn, uniformly from D. For each qi, find its nearest
neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D and v ≠ qi
Calculate the Hopkins Statistic:
If D is uniformly distributed, ∑ xi and ∑ yi will be close to each other and
H is close to 0.5. If D is highly skewed, H is close to 0
41
Determine the Number of Clusters
Empirical method
# of clusters ≈√n/2 for a dataset of n points
Elbow method
Use the turning point in the curve of sum of within cluster variance w.r.t
the # of clusters
Cross validation method
Divide a given data set into m parts
Use m – 1 parts to obtain a clustering model
Use the remaining part to test the quality of the clustering
E.g., For each point in the test set, find the closest centroid, and use the
sum of squared distance between all points in the test set and the closest
centroids to measure how well the model fits the test set
For any k > 0, repeat it m times, compare the overall quality measure w.r.t.
different k’s, and find # of clusters that fits the data the best
42
Measuring Clustering Quality
Two methods: extrinsic vs. intrinsic
Extrinsic: supervised, i.e., the ground truth is available
Compare a clustering against the ground truth using certain
clustering quality measure
Ex. BCubed precision and recall metrics
Intrinsic: unsupervised, i.e., the ground truth is unavailable
Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the clusters
are
Ex. Silhouette coefficient
43
Measuring Clustering Quality: Extrinsic Methods
Clustering quality measure: Q(C, Cg), for a clustering C given
the ground truth Cg.
Q is good if it satisfies the following 4 essential criteria
Cluster homogeneity: the purer, the better
Cluster completeness: should assign objects belong to the
same category in the ground truth to the same cluster
Rag bag: putting a heterogeneous object into a pure cluster
should be penalized more than putting it into a rag bag (i.e.,
“miscellaneous” or “other” category)
Small cluster preservation: splitting a small category into
pieces is more harmful than splitting a large category into
pieces
44