What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
3
Clustering for Data Understanding and
Applications
Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth
observation database
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Climate: understanding earth climate, find patterns of atmospheric
and ocean
Economic Science: market resarch
4
Clustering as a Preprocessing Tool (Utility)
Summarization:
Preprocessing for regression, PCA, classification, and
association analysis
Compression:
Image processing: vector quantization
Finding K-nearest Neighbors
Localizing search to one or a small number of clusters
Outlier detection
Outliers are often viewed as those “far away” from any
cluster
5
Quality: What Is Good Clustering?
A good clustering method will produce high quality
clusters
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
the similarity measure used by the method
its implementation, and
Its ability to discover some or all of the hidden patterns
6
Measure the Quality of Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
Weights should be associated with different variables
based on applications and data semantics
Quality of clustering:
There is usually a separate “quality” function that
measures the “goodness” of a cluster.
It is hard to define “similar enough” or “good enough”
The answer is typically highly subjective
7
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one
class)
Similarity measure
Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
8
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of
these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
High dimensionality
9
Major Clustering Approaches (I)
Partitioning approach:
Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
using some criterion
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
10
Major Clustering Approaches (II)
Model-based:
A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: p-Cluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific
constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects are often linked together in various ways
Massive links can be used to cluster objects: SimRank, LinkClus
11
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
12
Partitioning Algorithms: Basic Concept
Partitioning method: Partitioning a database D of n objects into a set of
k clusters, such that the sum of squared distances is minimized (where
ci is the centroid or medoid of cluster Ci)
E ik1 pCi ( p ci ) 2
Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
13
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in four
steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
Assign each object to the cluster with the nearest
seed point
Go back to Step 2, stop when the assignment does
not change
14
An Example of K-Means Clustering
K=2
Arbitrarily Update the
partition cluster
objects into centroids
k groups
The initial data set Loop if Reassign objects
needed
Partition objects into k nonempty
subsets
Repeat
Compute centroid (i.e., mean Update the
cluster
point) for each partition
centroids
Assign each object to the
cluster of its nearest centroid
Until no change
15
Comments on the K-Means Method
Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is
# iterations. Normally, k, t << n.
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
Comment: Often terminates at a local optimal.
Weakness
Applicable only to objects in a continuous n-dimensional space
Using the k-modes method for categorical data
In comparison, k-medoids can be applied to a wide range of
data
Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al., 2009)
Sensitive to noisy data and outliers
Not suitable to discover clusters with non-convex shapes
16