UNIT: 7
CLUSTERING
Unit : 7 : Clustering
Sr.No. Topics 06
Introduction, categorization of Major,
1 1
Clustering Methods:-
2 1
partitioning methods- K-Means
3 1
Hierarchical Method
4 1
Hierarchical- Agglomerative
5 1
Hierarchical- Divisive methods
6 1
Model- based- Expectation
7 1
Model- based- Maximization
8 1
2
Clustering: Basic Idea
Groups similar data together into clusters
Clustering is accomplished by determining the
similarity among the data on predefined
attributes.
Most similar data are grouped into clusters
Partitioning or segmentation
Clustering is unsupervised, because
No previous categorization known
Totally data driven
Classes are formed after examining the data
3
Clustering : Application
▷ Marketing Management: Discover distinct groups
in customer bases, and then use this knowledge to
develop targeted marketing programs
▷ WWW
○ Cluster Weblog data to discover groups of
similar access patterns
▷ Text Mining
○ Grouping documents with similar characteristics
4
Clustering : Example
▷ A good clustering method will produce high quality
clusters
* *
** *
* * * * **
* * *
* *
*
*
*
* * **
*
*
*
5
Similarity?
Groups of similar customers
Similar demographics
Similar buying behavior
Similar health
Similar products
Similar cost
Similar function
Similar store
…
Similarity usually is domain/problem specific
6
Clustering (Contd.)
▷ The output of a clustering algorithm consists
of a summarized representation of each
cluster.
▷ The type of summarized representation
depends strongly on the type & shape of
clusters the algorithm computes.
7
Types of Clustering
▷ Hard Clustering:
○ Each record is in one and only one cluster
▷ Soft Clustering:
○ Each record has a probability of being in each
cluster
8
Type of Clustering Algorithms
Hierarchical clustering
Divisive clustering (top down)
Agglomerative clustering (bottom up)
Partitional clustering
K-means clustering
K-medoids clustering
EM (expectation maximization) clustering
Density-Based Methods
Regions of dense points separated by sparser
regions of relatively low density
9
Type of Clustering Algorithms
▷ A Hierarchical clustering algorithm generates a tree
of clusters.
▷ It is categorized as Agglomerative and Divisive.
▷ Agglomerative start with each object in an
individual cluster and nearby clusters are
repeatedly merged resulting in larger and larger
clusters until some stopping criterion is met or all
objects are merged into a single large cluster with
highest level of hierarchy.
▷ Divisive approach starts by putting all data objects
into one cluster and repeatedly perform splitting
which results in smaller and smaller clusters until a
stopping criterion is met or each cluster has only
one object in it.
10
Agglomerative algorithm
1. Bottom-up approach:
2. Initially each item x1, . . . , xn is in its own cluster C1, . .
. ,Cn.
3. Repeat until there is only one cluster left:
Merge the nearest clusters, say Ci and Cj .
▷ The result is a cluster tree. One can cut the tree at
any level to produce different clustering.
11
12
Agglomerative algorithm
1. Allocate each point to a cluster of its own. So start
with n clusters for n objects.
2. Using a distance measure, at each step identify two
clusters to be merged.
Single link approach Find two clusters that have
smallest distance between them (find nearest
neighbor).
Complete link approach Find two clusters that
have largest distance between them (find farthest
neighbor).
There are other approaches too……
13
Agglomerative algorithm
Single link approach
3. Two clusters are merged if
min dist between any two points <= threshold dist
d (ci ,c j ) min d ( x, y )
xci , yc j
4. Identify the pair of objects and merge them.
5. Compute all distances from the new cluster and
update the distance matrix after merger.
6. Continue merging process until only one cluster is left
or stop merging once the distance between clusters is
above the threshold.
▷ Single-linkage tends to produce stringy or elongated
clusters that is clusters are chained together by only
single objects that happen to be close together.
14
Single Link Example
15
Agglomerative algorithm
Complete link approach
3. Two clusters are merged if
max dist between any two points <= threshold
dist
d (ci ,c j ) max d ( x, y )
xci , yc j
4. Identify the pair of objects and merge them.
5. Compute all distances from the new cluster and
update the distance matrix after merger.
6. Continue merging process until only one cluster is
left or stop merging once the distance between
clusters is above the threshold.
▷ Clusters tend to be compact and roughly equal in
diameter.
16
Complete Link Example
17
Agglomerative algorithm
Average link approach
d (ci ,c j ) d ( x, x ' ) / c
xci x 'cj
i .cj
▷ Stop merging once the average distance between
clusters is above the threshold.
▷ Somewhere between single-linkage and complete-
linkage.
▷ and a million other ways you can think of …
18
Dendogram: Hierarchical Clustering
• Clustering obtained by
cutting the dendrogram
at a desired level: each
connected component
forms a cluster.
19
19
20
▷ Use single and complete link
agglomerative clustering to group the
data described by the following distance
matrix. Show the dendrograms.
21
Single link
Dendrogram
AB C D
AB 0 2 5
C 0 3
D 0
ABC D
ABC 0 3
D 0
22
23
Complete link
Dendrogram
AB C D
AB 0 4 6
C 0 3
D 0
ABC D
ABC 0 6
D 0
24
complete link: distance between two clusters is the longest distance
between a pair of elements from the two clusters
25
▷ Use single and complete link
agglomerative clustering to group the
data described by the following distance
matrix. Show the dendrograms.
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
26
Single link
A B C D E
A 0 1 2 2 3
3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3 2
E 3 3 5 3 0
AB CD E 1
AB 0
CD 2 0
E 3 3 0 A B C D E
ABCD E
ABCD 0
E 3 0
27
Complete Link
D K K Comment
0 5 {A}, We start with the point = cluster
{B},{C},{D},{E}
1 3 {AB} {CD} {E} We merge {A} and {B} and {C } and {D} which are closest
d(A,B)<=1 d(C,D)<=1
2 2 {ABCD}, {E} We merge {AB} and {CD} because we have closest d(A,C)=2
3 1 {ABCDE} Merge E
28
complete link
A B C D E
A 0 1 2 2 3
5
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3 3
E 3 3 5 3 0
AB CD E 1
AB 0
CD 4 0
E 3 5 0 A B E C D
ABE CD
ABE 0
CD 5 0
29
D K K Comment
0 5 {A}, We start with the point = cluster
{B},{C},{D},{E}
1 3 {AB},{CD},{E} We merge {A} and {B} and {C } and {D} which are closest
d(A,B)=1 <=1 and d(C,D)=1 <=1
2 3 {AB},{CD},{E} d(AB,E)=3>2 so cant merged , d(CD.E) =5>2
3 2 {ABE},{CD}, Merge E with AB because there we d(AB,E)<=3 satisfied
4 2 {ABE},{CD}, Distance between this is 5 so 5>=4 not satisfied so cant merge
5 1 {ABCDE} Merge both cluster
30
Average link
▷ Average Linkage In average linkage
hierarchical clustering, the distance
between two clusters is defined as the
average distance between each point in
one cluster to every point in the other
cluster. For example, the distance
between clusters “r” and “s” to the left is
equal to the average length each arrow
between connecting the points of one
cluster to the other.
31
Average link cluster
32
33
34
35
36
37
38
Merge the clusters
39
Agglomerative algorithm
NOTE:
Start with:
▷ Initial no of clusters=n.
▷ Initial value of threshold dist, d = 0.
▷ Increment d in each iteration.
▷ Check condition (min dist<=d) for merging.
▷ Continue until you get one cluster.
40
Weakness of Agglomerative
▷ Major weakness of agglomerative clustering
methods
○ Do not scale well: time complexity of at least
O(n2), where n is the number of total objects
○ Can never undo what was done previously
41
Hierarchical Clustering algorithms
▷ Divisive (top-down):
○ Generate Spanning Tree (no loop in the graph)
○ Start with all documents belong to the same cluster.
○ Eventually each node forms a cluster on its own.
▷ Does not require the number of clusters k in
advance
▷ Needs a termination/readout condition
○ The final mode in both Agglomerative and Divisive is of no
use.
42
Overview
▷ Divisive Clustering starts by placing all
objects into a single group. Before we
start the procedure, we need to decide on
a threshold distance.
The procedure is as follows:
▷ The distance between all pairs of objects
within the same group is determined and
the pair with the largest distance is
selected.
43
Hierarchical Clustering
▷ Divisive Approaches Initialization:
All objects stay in one cluster
Iteration:
a ab Select a cluster and split it into
b abcde
two sub clusters
Until each leaf cluster contains
c
cde only one object
d
de
e
Step 4 Step 3 Step 2 Step 1 Step 0 Top-down
44
Overview-contd
▷ This maximum distance is compared to the
threshold distance.
○ If it is larger than the threshold, this group is
divided in two. This is done by placing the
selected pair into different groups and using
them as seed points. All other objects in this
group are examined, and are placed into the new
group with the closest seed point. The
procedure then returns to Step 1.
○ If the distance between the selected objects is
less than the threshold, the divisive clustering
stops.
▷ To run a divisive clustering, you simply need
to decide upon a method of measuring the
distance between two objects.
45
A 1 B
2
• Max d(E,D) = 3 so {E}
,{ABCD}
C
• Max d(B,C) = 2 so {E}
{AB}{CD}
1
E D • Finally {E}, {A},{B},{C},{D},{E}
3
46
Type of Clustering Algorithms
▷ A Partitional clustering algorithm partitions the
data into k groups such that some criterion that
evaluates the clustering quality is optimized.
▷ The number of clusters k is a parameter whose
value is specified by the user.
47
Partitioning Algorithms
▷ Partitioning method: Construct a partition of a
database D of n objects into a set of k clusters
▷ Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
○ Global optimal: exhaustively enumerate all
partitions
○ Heuristic methods: k-means and k-medoids
algorithms
○ k-means (MacQueen, 1967): Each cluster is
represented by the center of the cluster
○ k-medoids or PAM (Partition around medoids)
(Kaufman & Rousseeuw, 1987): Each cluster is
represented by one of the objects in the cluster
48
K-Means Clustering
▷ Given k, the k-means algorithm consists of
four steps:
○ Select initial centroids at random.
○ Assign each object to the cluster with the nearest
centroid.
○ Compute each centroid as the mean of the objects
assigned to it.
○ Repeat previous 2 steps until no change.
49
K-Means Clustering (contd.)
▷ Example
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
50
51
Example:
▷ Suppose we want to group the visitors to
a website using just their age (one-
dimension
▷ n = 19
▷ 15,15,16,19,19,20,20,21,22,28,35,40,41,
42,43,44,60,61,65
▷ al space) as follows:
52
▷ Initial clusters (random
centroid or average): k = 2
▷ c1 = 16
c2 = 22
53
New centroid
Iteration 1 c1 = 15.33
c2 = 36.25
3
3
4
4
54
c1 = 18.56
Iteration 2 c2 = 45.90
55
c1 = 19.50
Iteration 3: c2 = 47.89
56
c1 = 19.50
Iteration 4: c2 = 47.89
57
Do it yourself
▷ Use k-means algorithm to cluster the
following points into 2 clusters:
▷ {2, 4, 10, 12, 3, 20, 30, 11, 25}
▷ Suppose that initial cluster centers (seed)
are 3 & 4. Run K-means algorithm for
three iterations.
58
x m1=3 m2=4 C1=x-m1 C2=x-m2 cluster new centroid
2 3 4 1 2 1
3 3 4 0 1 1 2.5
4 3 4 1 0 2
10 3 4 7 6 2
11 3 4 8 7 2
12 3 4 9 8 2
20 3 4 17 16 2
25 3 4 22 21 2
30 3 4 27 26 2 16
x m1=2.5 m2=16 C1=x-m1 C2=x-m2 cluster new centroid
2 2.5 16 0.5 14 1
3 2.5 16 0.5 13 1
4 2.5 16 1.5 12 1 3
10 2.5 16 7.5 6 2
11 2.5 16 8.5 5 2
12 2.5 16 9.5 4 2
20 2.5 16 17.5 4 2
25 2.5 16 22.5 9 2
30 2.5 16 27.5 14 2 18 59
x m1=3 m2=18 C1=x-m1 C2=x-m2 cluster new centroid
2 3 18 1 16 1
3 3 18 0 15 1
4 3 18 1 14 1
10 3 18 7 8 1 4.75
11 3 18 8 7 2
12 3 18 9 6 2
20 3 18 17 2 2
25 3 18 22 7 2
30 3 18 27 12 2 19.6
x m1=4.75 m2=19.6 C1=x-m1 C2=x-m2 cluster new centroid
2 4.75 19.6 2.75 17.6 1
3 4.75 19.6 1.75 16.6 1
4 4.75 19.6 0.75 15.6 1
10 4.75 19.6 5.25 9.6 1
11 4.75 19.6 6.25 8.6 1
12 4.75 19.6 7.25 7.6 1 7
20 4.75 19.6 15.25 0.4 2
25 4.75 19.6 20.25 5.4 2
60
30 4.75 19.6 25.25 10.4 2 25
x m1=7 m2=25 C1=x-m1 C2=x-m2 cluster new centroid
2 7 19.6 5 17.6 1
3 7 19.6 4 16.6 1
4 7 19.6 3 15.6 1
10 7 19.6 3 9.6 1
11 7 19.6 4 8.6 1
12 7 19.6 5 7.6 1 7
20 7 19.6 13 0.4 2
25 7 19.6 18 5.4 2
30 7 19.6 23 10.4 2 25
61
62
Measurement metric
63
Euclidean
2 3 V2
X = 2, 0 [ (y - x)² ]
1
Y = -2, -2
-3 -2 -1 0
-3 -2 -1 0 1 2 3 V1
([-2 – 2]2 + [-2 –) 0]2
( 42 + 22) = 20 = 4.47
4.47 represents the “multivariate dissimilarity” of X & Y
64
Squared Euclidean - (y - x)²
X = 2, 0
2 3 V2
Y = -2, -2
1
-3 -2 -1 0
([-2 – 2]2 + [-2 – 0]2 )
-3 -2 -1 0 1 2 3 V1
( 42 + 22) = 20
Squared Euclidean is a little better at “noticing” strays
• remember that we use a square root transform to “pull
in” outliers
• leaving the value squared makes the strays stand out a
bit more 65
Manhattan distance
▷ The function computes the distance that
would be traveled to get from one data
point to the other if a grid-like path is
followed. The Manhattan distance
between two items is the sum of the
differences of their corresponding
components.
▷ The formula for this distance between a
point X=(X1, X2, etc.) and a point Y=(Y1,
Y2, etc.) is:
66
Manhattan distance or city block
|y – x|
X = 2, 0
Y = -2, -2
2 3 V2
([-2 – 2] + [-2 – 0])
1
-3 -2 -1 0
( 4 + 2) = 6
-3 -2 -1 0 1 2 3 V1
So named because in a city you have to go “around the
block” you can’t “cut the diagonal”
67
Chebyshev metric
Chebyshev distance is also known as Maximum value
distance. It calculate absolute magnitude of the differences
between coordinates of two objects:
68
Chebychev
max | y - x |
X = 2, 0
Y = -2, -2
2 3 V2
| -2 – 2 | = 4
1
| -2 – 0 | = 2
-3 -2 -1 0
max ( 4 & 2) = 4 -3 -2 -1 0 1 2 3 V1
Uses the “greatest univariate difference” to represent the
multivariate dissimilarity.
69
A Simple example showing the implementation of k-means
algorithm (using K=2) . Use Euclidean distance metric.
70
71
72
73
74
75
76
77
▷ A Simple example showing the implementation of k-means
algorithm (Two dimension) (using K=3).Use Manhattan
distance metric
X Y
1 2 10 Centroid
2 2 5 c1(2,10) c2(5,8) and c3(1,2)
3 8 4
4 5 8
5 7 5
6 6 4
7 1 2
8 4 9
78
Kmean
X Y C1’ C2’( C3’( Clus C1’ C2’( C3’( clust
(2,1 5,8) 1,2) ter (2,1 6,6) 1.5, er
0) 0) 3.5)
A1 2 10 0 5 9 C1 0 8 7 C1
A2 2 5 5 6 4 C3 5 5 2 C3
A3 8 4 12 7 9 C2 12 4 7 C2
A4 5 8 5 0 10 C2 5 3 8 C2
A5 7 5 10 5 9 C2 16 2 7 C2
A6 6 4 10 5 7 C2 10 2 5 C2
A7 1 2 9 10 0 C3 9 9 2 C3
A8 4 9 3 2 10 C2 3 5 8 C1
Second interation : new centroid : c1 => {A1} =>C1(2,10)
C2=>{A3,A4,A5,A6,A8} => ((8+5+7+6+4)/5, (4+8+5+4+9)/5)=> (30/5,30/5) => C2(6,6)
C3 => {A2,A7} => {(2+1)/2 , (5+2)/2} => C3(1.5,3.5)
3rd interation : C1(3,9.5)
C2(6.5,5.25)
C3(1.5,3.5)
79
80
Comments on the K-Means Method
81
EM Model
82
Soft Clustering
▷ Clustering typically assumes that each
instance is given a “hard” assignment to
exactly one cluster.
▷ Does not allow uncertainty in class
membership or for an instance to belong to
more than one cluster.
▷ Soft clustering gives probabilities that an
instance belongs to each of a set of clusters.
▷ Each instance is assigned a probability
distribution across a set of discovered
categories (probabilities of all categories
must sum to 1). 83
Model based clustering
▷ Algorithm optimizes a probabilistic model criterion
▷ Soft clustering is usually done by the Expectation
Maximization (EM) algorithm
○ Gives a soft variant of the K-means algorithm
○ Assume k clusters: {c1, c2,… ck}
○ Assume a probabilistic model of categories that
allows computing P(ci | E) for each category, ci, for
a given example, E.
○ For text, typically assume a naïve Bayes category
model.
○ Parameters = {P(ci), P(wj | ci): i{1,…k}, j
{1,…,|V|}} 84
Expectation Maximization (EM)
Algorithm
▷ EM algorithm provides a general approach to learning in
presence of unobserved variables
▷ Iterative method for learning probabilistic
categorization model from unsupervised data.
▷ Initially assume random assignment of examples to
categories.
▷ Learn an initial probabilistic model by estimating model
parameters from this randomly labeled data.
▷ Iterate following two steps until convergence:
○ Expectation (E-step): Compute P(ci | E) for each example
given the current model, and probabilistically re-label
the examples based on these posterior probability
estimates.
○ Maximization (M-step): Re-estimate the model
parameters, , from the probabilistically re-labeled data. 85
Applications
▷ Filling in missing data in samples
▷ Discovering the value of hidden variables
▷ Estimating the parameters of Hidden
Markov Model (HMM)
▷ Unsupervised learning of clusters
▷ Semi-supervised classification and
clustering.
86
Example
87
88
89
90
EM Example
91
92
Step 1
▷ Randomly select two probability
▷ θa probability of head with coin C1
▷ θb probability of head with coin C2
▷ Suppose : θa : 0.6
θb : 0.5
93
Step 2:
▷ Calculate the individual probability of C1
and C2 on first transaction which are as
follows:
▷ H:5
▷ T:5
94
Step :3 compute the likelyhood
95
Step 4
▷ Normalized the data or compute the
probability of calculated likelyhood
▷ Parameter A and B is Estimation parameter
96
Step 4
▷ Repeat the process for all the
experiments in our case the experiments
are 5
97
Step 5
98
Step 6:
▷ Compute the total H and T in c1 and c2 in
our case it is
C1 => 21.3 H and 8.6 T
C2 => 11.7 H and 8.4 T
99
Step 7
Calculate new probability of C1 and C2 for
getting head
This is maximization parameter θ for each coin
100
Step 8
▷ Repeat the process untill you get the
same probability again
▷ In our case it is,
101
EM Example (Do it yourself)
102
Step 1
▷ Randomly select two probability
▷ θa probability of head with coin C1
▷ θb probability of head with coin C2
▷ Suppose : θa : 0.6
θb : 0.5
103
Step 2:
▷ Calculate the individual probability of C1
and C2 on first transaction which are as
follows:
▷ H:3
▷ T:2
104
Step :3 compute the likelyhood
First round = 3 Head and 2 Tails
Liklyhood of A = (0.6)^3 * (0.4)^ 2 = 0.03456
Liklyhood of A = (0.5)^3 * (0.5)^ 2 =0.03125
After normalization the
Expected value of A = 0.03456/ (0.03456+0.03125) = 0.5251
Expected value of B = 0.03125/ (0.03456+0.03125) = 0.4748
105
Step 4
▷ Repeat the process for all the
experiments in our case the experiments
are 5
106
Coin A (Head) Coin A (Tail) Coin B (Head) Coin B (Tail)
0.5251 *3 =1.57 0.5251*2 =1.05 0.4748*3=1.424 0.4748*2
=0.9496
=2.6255 =0 =2.374 =0
=2.1 =0.5251 =1.899 =0.4748
=1.05 =1.57 =0.9496 =1.424
=2.1 =0.5251 =1.899 =0.4748
Total Head A Total Tail A Total Head B Total Head B
9.4455 2.67 8.545 3.319
107
▷ Calculate new probability of C1 and C2
for getting head
▷ This is maximization parameter θ for
each coin
▷ Θa = 9.44/ (9.44+2.67) = 0.72
▷ Θb = 8.5456/ (8.5456+3.32) = 0.72
▷ Repeat the process untill you get the
same probability again
108
Limitations
▷ The EM algorithm is very very slow, even
on the fastest computer. It works best
when you only have a small percentage of
missing data and the dimensionality of
the data isn’t too big. The higher the
dimensionality, the slower the E-step; for
data with larger dimensionality, you may
find the E-step runs extremely slow as the
procedure approaches a local maximum.
109
110