Unit 5 Clustering
Unit 5 Clustering
Chapter 5
Clustering
5.1 Introduction
Chapters 3 and 4 describe how samples may be classified if a training set is available to
use in the design of a classifier. However, there are many situations where the classes
themselves are initially undetined. Given a set of feature vectors sampled from some
population, we would like to know if the data set consists of a number of relatively
distinct subsets. If it does and we can determine these subsets, we can define them to
be classes. This is sometimes called class discovery. The techniques from Chapters
3 and 4 can then be used to further analyze or model the data or to classify new data
if desired. Clustering refers to the process of grouping samples so that the samples
called clusters.
similar within each group. The groups
are
are
be to discover the subgroups rather than
In applications, the main goal may
some
the marketing director of a firm that supplies
to model them statistically. For example,
businesses in a particular community fall
to know if the
business services may want so that specific service packages and
of similar c o m p a n e s
nto any natural groupings or subgroups. Reading the public
be designed tor
each these
some of these subgroups could
marketing plans can
give idea or what an
pixels
with similar gray
Animais
Ca's
Dogs
S t . B e m a r d L a b r a d o r
4
5
3
2
clustering.
hierarchical
5.1: A
Figure
visiua!
found through
can be
clusters
(1.12).13)14).15).
each consisting ot an ndividual sample. At level 1, the clusters are
{1,2), 13),{4,5).
At level 3, the clusters are
{1,2,3), {4,5.
At level 4, the single cluster
{1,2,3,4,5}
consists of all the samples.
In a hierarchical clustering, level two samples belong to a cluster, they
if at some
bottom up and they are called divisive if they build the dendrogram
drogram from the
from the top down.
is straightforward to describe. The
The general agglomerative clustering algorithm
denoted by n.
total number of samples will be
times.
2. Repeat step 3 a total of n - 1
and C; into one cluster.
clusters Ci and C; and merge C;
3. Find the most similar
the first pair found.
If there is a tie, merge
clustering algorithms
are obtainedby using different meth-
Different hierarchical measure the similarity between
of clusters. One way to
ods to determine the similarity between clusters. This distance
measures distance
clusters is to define a function that
function that measures the distance
is induced by an underlying
function typically as n nearest neighbor techniques (Sec-
In cluster analysis
between pairs of samples. m e a s u r e s are Eucidean distance
and city block
distance
tion 4.4), the most popular
distance.
202 ERING
&
1 2
15 20 25 30
05 10
Feature 1
defining the distance between two clusters to be the smallest distance between two
points such that one point is in each cluster. Formally, if C; and C; are clustes, the
distance between them is defined as
12 3
{1,2) 8.1 16.0 17.9
8.1 9.8 9.8
4 16.0 9.8 8.0
5 17.9 9.8 8.0
The value 8.1 in row {1,2} and column 3 gives the distance between the clusters {1,2}
and 13 and is computed in the following way. Matrix (5.1) shows that d(1,3) = 11.7
and d{2,3) = 8.1. In the single-linkage algorithm, the distance between clusters is the
minimum of these values, 8.1. The other values in the first row are computed in a
similar way. The values in other than the first row or first column are simply copied
from the previous table (5.1). Since the minimum value in this matrix is 8, the clusters
{4) and {5} are merged. At this point there are three clusters:
{1,2 3 {4,5}
{1,2} 8.1 16.0
3 8.1 9.8
14,5 16.0 9.8
Since the minimum value in this matrix is 8.1, the clusters {1, 2 and {3} are merged.
(1,2,3),14,5).
single-linkage algorithm.
The distance
Figue 5.3: Hierarehical clustering using the
the vertical axis.
Ds1 betwen clusters that merge is shown on
{1.2 3 5
{1.2} 11.7 20.0 21.5
11.7 9.8 9.8
20.0 9.8 8.0
21.3 9.8 8.0
5.2. HIERARCHICAL CLUSTERING 205
The value 11.7 in row {1,2} and column 3 gives the distance between the clusters {1,2}
and (3} and is computed in the following way. Matrix (5.1) shows that d(1,3) = 11.7
and d(2,3) = 8.1. In the complete-linkage algorithm, the distance between clusters is
the maximum of these values, 11.7. The other values in the first row are computed in
a similar way. The values in other than the first row or first column are simply copied
from (5.1). Since the minimum value in this matrix is 8, the clusters {4} and {5} are
merged. At this point the clusters are
(1,2) 3 (4,5
{1,2)
3 11.7
11.7 21.5
9.8
(4,5) 21.5 9.8
Since the minimum value in this matrix is 9.8, the clusters {3} and {4,5} are merged.
At this point the clusters are
{1,2}, {3,4,5).
Notice that these clusters are different from those obtained at the corresponding point
of the single-linkage algorithm.
At the next step, the two remaining clusters will be merged. The hierarchical
clustering is complete. The dendrogram is shown in Figure 5.4.
Single-linkage algorithm allows clusters to grow long and thin whereas the complete
g e algorithm produces more compact clusters. Both clusterings are susceptible to
CHAPTER 5. CLUSTERIN
206
1 2 3 4
da, b).
aECi,bECi
Example 5.3
Hierarchical clustering ustng the
Perform a hierarchical average-linkage algorithm.
th:
in clustering using the
Figure 5.2. Use Euclidean
The distance (4.1)average-lnkage
algorithm with five clusters, each the distancealgorithm
for
clusters {1} and begins on
the data shown
{2} are then between sa
merged to form theconsisting
clusters
of one
samnl mp
nearest
1,2),{3), {4}, 15).
52 HIERARCHICAL CLUSTERING 207
5.2
that gives the distance between these clusters:
Next obtain the matrix
1,2) 3 5
{1,2) 9.9 18.0 19.7
9.9 9. 8 9.8
4 18 9.8 8.0
5 19.7 9.8 8 . 0
and {3} and is computed in the following way. distance between clusters is
and d(2,3) 8.1. In the average-linkage algorithm, the
=
a r e computed in
a
these values, 9.9. The other values in the first row
the average of
row o r first column are simply copied
similar way. The values in other than the first
Since the minimum value in this matrix is 8,
the clusters {4} and {5} are
from (5.1).
merged. At this point the clusters are
3 14,5
{1,2) 9.9 18.9
{1,2) 9.8
3 9.9
{4,5} 18.9 9.8
Ward's Method
method. Like the other
minimum-variance
method is also called
the At
Ward's with o n e cluster for
each individual sample.
gorithms, Ward's method begins it merges the pair that
produces the smallest
all pairs of clusters, cluster is
nteration, among e r r o r for each
of clusters. The squared
for the resulting set
error t h e feature
Where x; 1S
ared samples X1,.. .
, Xm
e n e d as follows. If a cluster contains m
208 CHAPTER 5.
vector (Til, .. . , Tia), the squared error for sample xXi-which is
m
i=1
The squared error E for the entire cluster is the sum of the
samples squared errore
of the
mo.
The vector
composed of the means of each
mean vector or
centroid of the cluster. Thefeature, (1, . , Ha) =
4, is
squared distances in each feature from squared error for a cluster iscalled
of the the
squared error is thus equal to the the cluster members to
their
the sum
of total
samples in the cluster m, where the variance of the cluster mean. The
the sum of the total variance is o times the number
defined variances for each defined to be a
feature. The
squared error for a set ofof+...+o
=
to be the sum of
the squared
errors for the
individual clusters is
clusters.
Example 5.4
Hierarchical clustering
Perform a hierarchical using Ward's method.
5.2. The
algorithm clustering using Ward's
begins
point, the squared error is with five clusters each method on the data
shown in Figure
Merge {1} and (2,
for each
zero. There
merge {1} and
are 10 consisting
possible
of one
sample. At tS
possibility. For example, {3}, and ways to
merge
5.5 shows thepair
so on. a of
the feature Figure cluste
vector (4,4) and sample 2consider merging {1} and {2}.
are 6 and 4. squared
The has the
squared error for cluster feature vector Since sample erto i
Squared
Clusters Error, E
{1,2).{3).{4),{5) 8.0
{1,3),{2,{4),15} 68.5
{1,4},{2},f3},{5} 200.0
{1,5),{2),{3},{4} 232.0
{2,3},{1}.{4},{5} 32.5
{2,4}.{1}.{3),{5} 128.0
{2,5},{1},{3},14} 160.0
{3,4}.{1},{2},{5} 48.5
{3,5},{1},{2).{4} 48.5
L(4,5),{1},{2}13}| 32.0
Squared
Clusters Error, E
72.7
{1,2,3)14),{5} 224.0
{1,2,4},{3},{5}
{1,2,5),{3},{4} 266.7
{1,2),{3,4},{5} 56.5
{1,2),{3,5},{4} 56.5
40.0
{1,2),{4,5},{3}|
Error, E
Clusters 104.7
{1,2,3),{4,5}
380.0
{1,2,4,5),{3
94.0
{1,2},{3,4,5}
clusters.
errors
for two
5.7: Squared
Figure
method.
Figure 5.8: Dendrogram for Ward's
cu
are merged and the hierarchical
step, the two remaining clusters
At the next
is shown in Figure 5.8.
tering is complete. The resulting dendrogram
that partitions the data into similar groups. Samples close to one another are tbat
to be similar and the goal of the partitional clustering algorithms is to group d to be
to
are close together. In many of the partitional algorithms, the number of cluste
constructed is specified in advance.
the
and
If a
partitional algorithm is used to divide the data set into two group ogra
Broups,
each of these groups is divided into two parts, and so on, a hierarchical
could be produced from the top down. The hierarchy produced by
aeisive
this
5.3. PARTITIONAL CLUSTERING 211
Forgy's Algorithm
|Forgy).
partitional clustering algorithms is Forgy's algorithm
to be
One of the simplest consists of k, the
number of clusters
to the algorithm
Besides the data, input The seed points could be
chosen
called seed points.
constructed, and k samples structure could be
used to guide
or some
of the desired cluster
knowledge
randomly,
their selection.
Forgy's Algorithm
Nearest
Sample Cluster Centroid
(4,4) (4,4)
(8,4 (8,4)
(15,8) (8,4)
(24,4) (8,4)
(24,12) (8,4)
Figure 5.9: First iteration of Forgy's algorithm.
Nearest
Sample Cluster Centroid
(4,4) (4,4)
(8,4) (4,4)
(15,8) (17.75,7)
(24,4) (17.75,7)
(24,12) (17.75,7)
Figure 5.10: Second iteration of Forgy's algorithm.
and
(4 +8+4 + 12)/4 =7.
Since some samples changed clusters (there were initially no clusters), return to step
2.
Find the cluster centroid nearest each sample. Figure 5.10 shows the results. The
clusters {(4,4), (8, 4)} and {(15,8), (24, 4), (24, 12)} are produced.
For step 4, compute the centroids (6, 4) and (21,8) of the clusters. Since the sample
In this version of Forgy's algorithm, the seed points are chosen arbitrarily as the
first two samples; however, other possibilities have been suggested. One alternat1ve s
to begin with k clusters generated by one of the hierarchical clustering algorithms ana
use their centroids as initial seed points.
5.3. PARTITIONAL CLUSTERING 213
Nearest
Sample Cluster Centroid
(4,4) (6,4)
(8,4) (6,4)
(15,8) (21,8
(24,4) (21,8)
(24,12) (21,8)
k-means Algorithmn
1. Begin with k clusters, each consisting of one of the first k samples. For each
of the remaining n k samples, find the centroid nearest it. Put the
-
sample in
the cluster identified with this nearest centroid. After each
sample is assigned,
recompute the centroid of the altered cluster.
. Go through second time. For each sample, find the centroid nearest
the data a
E Pat the sample in the cluster identified with this nearest centroid. (During
nls
step, do not recompute any centro1d.)
Distance to
214
Distanceto Centroid
(24,8)
Centroid
(9,5.3) 16.5
Sample 1.6 4.0
(8,4) 15.L 9.0
(24,4) 6.6 40.4
(15,8) 6.6 4.0
(4,4) 16.4
(24,12)
k-means algorithm.
2 of the
Distances for use by step
5.12:
Figure
the data in .
the k-means algorithm
on
Figure
using the hrst two samples a r
partitional clustering so t h a t
Perform a are ordered
the data
k =2 and a s s u m e that
5.2. Set
which have
centroids at
(8, 4) and (24, 4). {(8,4)} and {(24, 4)}
begin with two clusters find the centroid nearest
For step 1, three samples,
For each of the remaining
(8,4) and (24, 4). recompute the centroid of this
cluster.
this cluster, and
it, put the sample in so it joins cluster {(8,4)
nearest the centroid (8,4)
The next sample (15,8) is The centroid of the first
At this point, the clusters are {(8,4), (15, 8)} and {(24,4)}.
(11.5, 6) since
cluster is updated to
6.
(8+15)/2=11.5, (4+8)/2 =
The next sample (4, 4) is nearest the centroid (11.5, 6) so it joins cluster {(8,
(15,8)). At this point, the clusters are {(8,4), (15, 8), (4,4)} and {(24,4)}. The cet
troid of the first cluster is updated to (9,5.3).
The next sample (24, 12) is nearest the centroid (24, 4) so it joins cluster {(24,4)
At this point, the clusters are{(8,4), (15,8), (4, 4)} and {(24, 12), (24, 4)}. The ceno
of the second cluster is
updated to (24,8). At this point, step 1 of the
complete. algori
For
step 2, examine the samples one by one and entifed
The goal of Forgy's algorithm and the k-means algorithm is to minimize the squared
error (as defined in Section 5.2) for a fixed number of clusters. These algorithms assign
samples to clusters so as to reduce the squared error and, in the iterative versions, they
stop when no further reduction occurs. However, to achieve reasonable computation
time, they do not consider all possible clusterings. For this reason, they sometimes
terminate with a clustering that achieves a local minimum squared error. Furthermore,
in general, the clusterings that these algorithms generate depend on the choice of the
seed points. For example, if Forgy's algorithm is applied to the data in Figure 5.2
using (8,4) and (24, 12) as seed points, the algorithm terminates with the clusters
{(4,4), (8, 4), (15, 8)}, {(24,4), (24, 12)). (5.2)
was 13 different from the clustering produced in Example 5.5. The clustering (5.2) has
error of 104.7 whereas the clustering of Example 5.5 has a squared e r r o r of
a94.squared
The clustering (5.2) produces a local minimum; the clustering of Example 5.5 can
be shown to produce a global minimum. For a given set of seed points, the resulting
clusters may also depend on the order in which the points are checked.
umbor
ot
ilerntion
atart: maxin per
ter nergen
chunter
of
uber
max merge:
m a x i n m m
withn
the
mln
part. of the a.
gorithn
iterationa
umber of
m a x i m u m
Thee
Isodata A l g o r i t h n
algorithm.)
lnilinlize the cluster centrois
are like Porgy's
. (Steps I through
the eend points.
centroid
nearest it. Put the sampleiintothe
For each sample,
find the cluster
2. neurest
cluster centroid,.
chusteridentified with this
clusters.
centroids of the resulting
3. Compute the
number of iterations is less than
clusters and the
4. If at least one sample changed
step 2.
iter.start, go to
discard the sampls
fewer than min.elements samples. Also
5. Discard clusters with
they contain.
n o - c l u s t e r s or the
equal to 2 *
is greater than or
6. If the number of chusters otherwise
execute step 7 (merging operation);
number of this iteration is even,
to step 8.
g
cu
these
centroids is less than min_dist, merge
7. If the distance between two
ters and update the centroid; otherwise, go
to step 7. Repeat this step up
times and then go to step 8.
max merge
umber
or the
8. If the number of clusters is less than or equal to no_clusters/2 to step
otherwise, Bo
iteration is odd, execute step 9
of this (splitting operation);
10.
which
y
9. Find a chuster that has a standard deviation for some variable, opee*
exceeds split.size *O, Where og is the standard deviation ot r i " i t h i n the
original set of sunples. If none, go to step 10. Compute the mean o r of tho
Isisting«
cluster. Separate the samples in this cluster into two sets, one conss of
ther
in which r
greater
is less than the e a n .
l e r
the
greater than or equal not
replace original cluster by these two clusters; otherwise,
cluster.
E
P
02 4 5 8 10 12
218 CHAPTER 5. CLUSTERINO
For step 6, the number of clusters is not greater than or equal to 2 * no.clustes.
and the number (0) of this iteration is even, so the merging operation (step 7):
executed.
)s
Since the distance between the centroids of the clusters (1,2, 4,6) and
less than min.dist, these clusters
f3, 5, 7) is
At this point, there are two
are merged. clustes
1,2,3,4,5, 6,7}, {8,9, 10, 11, 12, 13, 14}.
The merge step is not repeated (since max merge =1) so proceed to step 8. (In this
case, the remaining clusters could not be merged anyway since the distance between
their centroids is greater than min.dist.)
For step 8, the number of clusters
(2) is greater than no.clusters/2 = 1.5 and the
number of this iteration is not odd, so go to
step 10.
For step 10, since the number of iterations is less than the
and the
requested number (5)
clusters changed, proceed to step 2.
This time the Forgy part of the algorithm (steps 1 through 4) does not
clusters. change the
For step 5, since no cluster has fewer than minelements
carded. members, none is dis-
Forstep 6, the
number of clusters is not
and the number (1) of this iteration is
greater than or equal to 2 * no_clusters
odd, so proceed to step 8.
For step 8, the number of clusters
number of this iteration is odd, so the
(2) is greater than no.clusters/2 and the
splitting operation (step 9) is executed.
For step 9, there is cluster
for the variable
a
{8,9,10, 11, 12, 13, 14} that has a standard deviation
(y) exceeding split.size * Gy. The samples are then divided into two
sets that have y values less than, or greater than, the mean value of y in the cluster:
{8,9, 10, 11, 12, 13), {14}
Since the distance between their centroids is
the cluster remains
greater than or equal to 1.1 *
min.a.
split.
At this point there are three clusters:
(1,2,3,4,5,6,7), {8,9, 10, 11, 12, 13), {14}.
For step 10, since the number of
iterations is less than the
the clusters changed, proceed to step 2. requested number and
Again, theForgy part of the algorithm (steps
clusters.
1 through 4) does not change the
For step 5, cluster {14} is discarded since it has fewer than
At this point there are two clusters:
min elements members.
{1,2,3,4,5,6,7}, {8,9, 10, 11, 12, 13}.
5.3. PARTITIONAL CLUSTERING
219
For step 6, the number of
clusters is less than 2*
f this iteration is even, so the no
merging operation (step clusters
is
and the number
(2)
Since the distance between the 7) executed.
centroids of the clusters
{1,2,3,4,5,6,7} and
is not less than
{8,9,10,11,12, 13)
min.dist,
these clusters are not
For step 8, the number of
clusters (2) is
merged. Proceed to step 8.
number of this iteration is even, so go to step 10.greater than and the
no_clusters/2
For step 10, since the number of
the clusters changed, proceed to
iterations is less than the requested number and
step 2.
Again the Forgy part of the
algorithm (steps 1 through 4) does not change the
clusters.
For step 5, since no cluster has fewer than
carded.
min.elements members, none is dis
The isodata algorithm has been used in many engineering and scientific applica-
tions. Example 5.7 shows that the algorithm requires extensive computation, even for
a small data set. For large data sets, it may be too expensive to run on conventional
reports that the isodata algorithm required seven
single-processor computers. [Tilton|
hours of computing time on a VAÄ-11/780 to cluster the gray levels in a 512 x 512 im-
handwritten
X
X that has a
that has feature
feature
vector
image of a
digitized binary
Figure 5.14: A
(4, 10, 10, 4, 10,9,11,9).
umber o
by counting the num
obtained from each sample
were at
features center beginning
each of eio
persons. Eight on a line to the
noncharacter pixels was applied to a data
consecutive
The isodata algorithm
Figure 5.14). Figure 5.15 givesth
and X.
boundary positions (see characters 8, O,
of each of the is not used by th
consisting of 15 images class information
set. (The
classes for this data
feature values and chosen w e r e
The parameters
clustering algorithm.)
no.clusters= 4
min.elements= 8
min.dist = 2.5
splitsize =0.5
iter start
maxm e r g e = l
iter.body = 3
four samples.
and the seed points were the first
In the first iteration of the isodata algorithm, after the maximum um
throug
= the Forgy part of the algorithm (steps
(iterstart 5) of iterations of or
pe
numbers S
4),four clusters were obtained, which contained the following
from the three classes:
Number in Number in in
Number in | Number
Cluster 4
Class Cluster 1| Cluster 2 Cluster 3
8 2 13
14 0
1 0 14
NG
discarded. At
At this
NRIules
it is
three
point, the
| han enly
enter
, wiu
tep Number in
lusterx are Number in |
Numlber in |
Numlr i Cluster 3 Cluster 4
Cluster 2
Cter T
13
14 14
0
** no c1.
than orr equal to 2 equal
not greater no.clustera
number of clusters is operation
or step 6, the the merging (sten 7) is
is even, so
mumber (0) of this iterution
and the
clusters are merged.
lHowever, n0 =
exeruted. than no.clusters/2 1.5 and th.
number of clusters is greater
For step 8, the
so go to step
10.
mumber of this iteration is not odd, number
the requested (iter.body=
mumber of iterations is less than
For step 10, the
to step 2.
3) and the clusters changed, so proceed
In the second iteration of the isodata algorithm, the Forgy part of the algorithm
does not change the clusters.
For step 5, since no cluster has fewer than min_elements members, none is dis
carded.
For step 6, the number of clusters is not greater than or equal to 2 * no.clusters
and the number (1) of this iteration is odd, so proceed to step 8.
of clusters is greater than no.clusters/2 and the number
For step 8, the number
of this iteration is odd, so execute the
splitting operation (step 9).
For step 9, a split occurs producing the clusters
X 0 14
For step 10, since the
the clusters number of
iterations is less than the r and
changed, proceed
In the third and
to step 2. requested numoe
final iteration of the
algorithm converges isodata of the
after two iterations
to algorithm, the Forgy par
produce the clusters
Number in | Number in
ClassCluster 1 in in
0 Cluster 2 Number Number
Cluster 3 Cluster 4
1 13
X 2 0
0 2
0
14
XX
:
X
O
OO
O
OL 8 9 0 8 9 0
nje
O
xx
XX K
XXO"
T
OL 8 9 0
OL 8 9 0
9Bngo