Unsupervised Clustering
Clustering is a very general problem that appears in
many different settings (not necessarily in a data
mining context)
Grouping similar products together to improve
the efficiency of a production line
Packing similar items into containers
Grouping similar customers together
Grouping similar stocks together
1
The Similarity Concept
Obviously, the concept of similarity is key
to clustering.
Using similarity definitions that are specific to a
domain may generate more acceptable clusters.
E.g. products that require same or similar
tools/processes in the production line are
similar.
Articles that are in the course pack of the same
course are similar.
General similarity measures are required for
general purpose algorithms.
2
Clustering: The K-Means Algorithm
(Lloyd, 1982)
1. Choose a value for K, the total number of
clusters.
2. Randomly choose K points as cluster centers.
3. Assign the remaining instances to their
closest cluster center.
4. Calculate a new cluster center for each
cluster.
5. Repeat steps 3-5 until the cluster centers do
not change.
3
Distance Measure
The similarity is captured by a
distance measure in this algorithm.
The original proposed measure of
distance is the Euclidean distance.
X ( x1 , x2 , , x n ), Y ( y 1 , y 2 , , y n )
d ( x, y )
2
(
x
y
)
i i
i 1
An Example Using KMeans
Table 3.6
K-Means Input Values
Instance
1
2
3
4
5
6
1.0
1.0
2.0
2.0
3.0
5.0
1.5
4.5
1.5
3.5
2.5
6.0
f(x)
7
6
5
4
3
2
1
0
6
2
4
5
3
Table 3.7 Several Applications of the K-Means Algorithm (K = 2)
Outcome
Cluster Centers
Cluster Points
(2.67,4.67)
2, 4, 6
Squared Error
14.50
(2.00,1.83)
1, 3, 5
(1.5,1.5)
1, 3
15.94
(2.75,4.125)
2, 4, 5, 6
(1.8,2.7)
1, 2, 3, 4, 5
9.60
(5,6)
f(x)
7
6
5
4
3
2
1
0
General Considerations
Works best when the clusters in the data are of
approximately equal size.
Attribute significance cannot be determined.
Lacks explanation capabilities.
Requires real-valued data. Categorical data can be
converted into real, but the distance function needs to
be worked out carefully.
We must select the number of clusters present in the
data.
Data normalization may be required if attribute ranges
vary significantly.
Alternative distance measures may generate different
clusters.
9
K-means Clustering
Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
K-means Clustering Details
Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the cluster.
Closeness is measured by Euclidean distance, cosine similarity, correlation, etc.
K-means will converge for common similarity measures mentioned above.
Most of the convergence happens in the first few iterations.
Often the stopping condition is changed to Until relatively few points change clusters
Complexity is O( n * K * I * d )
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Two different K-means Clusterings
3
2.5
Original Points
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5
2.5
2.5
1.5
1.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
1.5
Optimal Clustering
-2
-1.5
-1
-0.5
0.5
1.5
Sub-optimal Clustering
Importance of Choosing Initial Centroids
Iteration 6
1
2
3
4
5
3
2.5
2
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5
Importance of Choosing Initial Centroids
Iteration 1
Iteration 2
1.5
1.5
1.5
2.5
2.5
2.5
0.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
Iteration 4
-1.5
-1
-0.5
0.5
1.5
-2
Iteration 5
1.5
1.5
1.5
0.5
0.5
0.5
-1
-0.5
0.5
1.5
-1
-0.5
0.5
1.5
1.5
Iteration 6
2.5
2.5
-1.5
-1.5
2.5
-2
Iteration 3
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the nearest
cluster
To get SSE, we square these errors and sum them.
K
SSE dist 2 ( mi , x )
i 1 xCi
x is a data point in cluster Ci and mi is the representative
point for cluster Ci
can show that mi corresponds to the center (mean) of
the cluster
Given two clusters, we can choose the one with the
smallest error
Importance of Choosing Initial Centroids
Iteration 5
1
2
3
4
3
2.5
2
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5
Importance of Choosing Initial Centroids
Iteration 1
1.5
1.5
2.5
2.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
Iteration 3
Iteration 2
1.5
-2
-1.5
-1
-0.5
0.5
Iteration 4
1.5
1.5
1.5
0.5
0.5
0.5
-1
-0.5
0.5
1.5
Iteration 5
2.5
2.5
-1.5
1.5
2.5
-2
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
1.5
Problems with Selecting Initial Points
If there are K real clusters then the chance of selecting one centroid from each cluster is small.
Chance is relatively small when K is large
If clusters are the same size, n, then
For example, if K = 10, then probability = 10!/10 10 = 0.00036
Sometimes the initial centroids will readjust themselves in right way, and sometimes they dont
Consider an example of five pairs of clusters
Hierarchical Clustering
Produces a set of nested clusters
organized as a hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the
sequences of merges or splits
5
0.2
4
3
0.15
0.1
2
1
0.05
3
0
Strengths of Hierarchical
Clustering
Do not have to assume any particular
number of clusters
Any desired number of clusters can be
obtained by cutting the dendogram at the
proper level
They may correspond to meaningful
taxonomies
Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, )
Hierarchical Clustering
Two main types of hierarchical clustering
Agglomerative:
Start with the points as individual clusters
At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left
Divisive:
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a point (or
there are k clusters)
Traditional hierarchical algorithms use a similarity or
distance matrix
Merge or split one cluster at a time
Agglomerative Clustering
Algorithm
More popular hierarchical clustering technique
Basic algorithm is straightforward
1.
2.
3.
4.
5.
6.
Compute the proximity matrix
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the proximity matrix
Until only a single cluster remains
Key operation is the computation of the
proximity of two clusters
Different approaches to defining the distance
between clusters distinguish the different algorithms
Starting Situation
Start with clusters of individual points
p1 p2
p3
p4 p5
...
and a proximity matrix
p1
p2
p3
p4
p5
.
.
.
Proximity Matrix
Intermediate Situation
After some merging steps, we have some clusters
C1
C2
C3
C4
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2
C5
C5
Intermediate Situation
We want to merge the two closest clusters (C2 and C5)
and update the proximity matrix.
C1
C2
C3
C4
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2
C5
C5
After Merging
The question is How do we update the proximity
C2
matrix?
C1
C1
C2 U C5
C3
C4
U
C5
C3
C4
?
?
C3
C4
Proximity Matrix
C1
C2 U C5
How to Define Inter-Cluster Similarity
p1
Similarity?
p2
p3
p4 p5
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
Wards Method uses squared error
...
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
Wards Method uses squared error
...
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
Wards Method uses squared error
...
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
Wards Method uses squared error
...
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
Wards Method uses squared error
...
Cluster Similarity: MIN or
Single Link
Similarity of two clusters is based on
the two most similar (closest) points
in the different clusters
Determined by one pair of points, i.e., by
one link in the proximity graph.
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
Hierarchical Clustering: MIN
1
3
5
0.2
Nested Clusters
0.15
0.1
0.05
0
Dendrogram
Strength of MIN
Original Points
Can handle non-elliptical shapes
Two Clusters
Limitations of MIN
Original Points
Sensitive to noise and outliers
Two Clusters
Cluster Similarity: MAX or Complete
Linkage
Similarity of two clusters is based on
the two least similar (most distant)
points in the different clusters
Determined by all pairs of points in the
two clusters
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
Hierarchical Clustering: MAX
4
2
5
0.4
0.35
0.3
0.25
6
1
Nested Clusters
0.2
0.15
0.1
0.05
0
Dendrogram
Strength of MAX
Original Points
Less susceptible to noise and outliers
Two Clusters
Limitations of MAX
Original Points
Tends to break large clusters
Biased towards globular clusters
Two Clusters
Cluster Similarity: Group Average
Proximity of two clusters is the average of pairwise
proximity between points in the two clusters.
proximity(p ,p )
proximity(
Cluster
i , Cluster
j)
piCluster
i
pjClusterj
|Clusteri | |Clusterj |
Need to use average connectivity for scalability since
total proximity favors large clusters
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
Hierarchical Clustering: Group
Average
5
2
5
0.25
0.2
0.15
6
1
Nested Clusters
0.1
0.05
0
Dendrogram
Hierarchical Clustering: Group
Average
Compromise between Single and
Complete Link
Strengths
Less susceptible to noise and outliers
Limitations
Biased towards globular clusters
Cluster Similarity: Wards
Method
Similarity of two clusters is based on the increase
in squared error when two clusters are merged
Similar to group average if distance between points is
distance squared
Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical analogue of K-means
Can be used to initialize K-means
Hierarchical Clustering:
Comparison
5
1
MIN
MAX
5
Wards Method
2
5
Group Average
1
4
2
3
1
5
6
1
Hierarchical Clustering: Time and Space requirements
O(N2) space since it uses the proximity
matrix.
N is the number of points.
O(N3) time in many cases
There are N steps and at each step the size,
N2, proximity matrix must be updated and
searched
Complexity can be reduced to O(N 2 log(N) )
time for some approaches
Hierarchical Clustering: Problems and Limitations
Once a decision is made to combine two
clusters, it cannot be undone
No objective function is directly minimized
Different schemes have problems with one or
more of the following:
Sensitivity to noise and outliers
Difficulty handling different sized clusters and
convex shapes
Breaking large clusters