2002 Spring CS525 Lecture 2
2002 Spring CS525 Lecture 2
                          0                               
                          d(2,1)                          
• Dissimilarity matrix                  0                 
                          d(3,1) d ( 3,2) 0               
  – (one mode)                                            
                          :             :     :           
                          d ( n,1) d ( n,2) ...    ... 0
             Measure the Quality of
                  Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in
  terms of a distance function, which is typically metric:
          d(i, j)
• There is a separate “quality” function that measures the
  “goodness” of a cluster.
• The definitions of distance functions are usually very
  different for interval-scaled, boolean, categorical, ordinal
  and ratio variables.
• Weights should be associated with different variables
  based on applications and data semantics.
• It is hard to define “similar enough” or “good enough”
   – the answer is typically highly subjective.
                 Major Clustering
                  Approaches
• Partitioning algorithms: Construct various partitions and
  then evaluate them by some criterion
• Hierarchy algorithms: Create a hierarchical decomposition
  of the set of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the
  clusters and the idea is to find the best fit of that model to
  each other
        Partitioning Algorithms: Basic
                   Concept
• Partitioning method: Construct a partition of a
  database D of n objects into a set of k clusters
• Given a k, find a partition of k clusters that
  optimizes the chosen partitioning criterion
   – Global optimal: exhaustively enumerate all partitions
   – Heuristic methods: k-means and k-medoids algorithms
   – k-means (MacQueen’67): Each cluster is represented by
     the center of the cluster
   – k-medoids or PAM (Partition around medoids) (Kaufman
     & Rousseeuw’87): Each cluster is represented by one of
     the objects in the cluster
The K-Means Clustering Method
2                                                 each
                                                            2                                                                                          the       2
1
                                                  objects
                                                            1
                                                            0
                                                                                                                                                       cluster   1
                                                                                                                                                                 0
0
     0   1   2   3   4   5   6   7   8   9   10   to
                                                                 0       1       2       3       4       5       6       7       8       9       10    means          0       1       2       3       4       5       6       7       8       9       10
                                                  most
                                                  similar                                                        reassign                                                                                             reassign
                                                  center     10                                                                                                   10
K=2 9 9
8 8
Arbitrarily choose 7 7
                                                                 6                                                                                                    6
     K object as initial                                         5                                                                                                    5
                                                                 2
                                                                                                                                                       the            3
1 cluster 1
                                                                 0
                                                                     0       1       2       3       4       5       6       7       8       9    10
                                                                                                                                                       means          0
                                                                                                                                                                          0       1       2       3       4       5       6       7       8       9    10
     Comments on the K-Means
            Method
• Strength: Relatively efficient: O(tkn), where n is #
  objects, k is # clusters, and t is # iterations. Normally, k,
  t << n.
       • Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Weakness
   – Applicable only when mean is defined, then what about
     categorical data?
   – Need to specify k, the number of clusters, in advance
   – Unable to handle noisy data and outliers
   – Not suitable to discover clusters with non-convex shapes
  Variations of the K-Means Method
– Dissimilarity calculations
                   10                                                10
                   9                                                 9
                   8                                                 8
                   7                                                 7
                   6                                                 6
                   5                                                 5
                   4                                                 4
                   3                                                 3
                   2                                                 2
                   1                                                 1
                   0                                                 0
                        0   1   2   3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids, 1987)
    – starts from an initial set of medoids and iteratively replaces one of the
      medoids by one of the non-medoids if it improves the total distance of
      the resulting clustering
    – PAM works effectively for small data sets, but does not scale well for
      large data sets
9 9 9
8 8 8
7 7 7
 6
                                                     Arbitrar     6
                                                                                                                                              Assign      6
 5
                                                     y            5                                                                           each        5
4 choose 4 remaini 4
 3
                                                     k object     3
                                                                                                                                              ng          3
 2
                                                     as           2
                                                                                                                                              object      2
                                                                  1                                                                                       1
                                                     initial                                                                                  to
 1
 0                                                                0                                                                                       0
     0   1   2   3   4   5   6   7   8   9   10
                                                     medoid            0       1   2   3   4       5       6       7       8       9    10
                                                                                                                                              nearest          0       1   2   3   4       5       6       7       8       9    10
                                                     s                                                                                        medoid
K=2                                                                                                                                           s                    Randomly select a
                                                                   Total Cost = 26                                                                                 nonmedoid
                                                                                                                                                                   object,Oramdom
                                                                  10                                                                                      10
Do loop 9
                                                                       8
                                                                                                                                             Compute
                                                                                                                                                               9
                                                                                                                                                               8
                                                  Swapping             7                                                                     total cost        7
     Until no                                     O and                6
                                                                                                                                             of                6
                                                  Oramdom
     change
                                                                       5                                                                                       5
                                                                       4
                                                                                                                                             swapping          4
If quality is 3
                                                                       2
                                                                                                                                                               3
improved. 1 1
                                                                       0                                                                                       0
                                                                           0   1   2   3   4   5       6       7       8       9   10                              0   1   2   3   4   5       6       7       8       9   10
 PAM (Partitioning Around Medoids) (1987)
• PAM (Kaufman and Rousseeuw, 1987), built in Splus
• Use real object to represent the cluster
   – Select k representative objects arbitrarily
   – For each pair of non-selected object h and selected object i,
     calculate the total swapping cost TCih
   – For each pair of i and h,
       • If TCih < 0, i is replaced by h
       • Then assign each non-selected object to the most similar
         representative object
   – repeat steps 2-3 until there is no change
PAM Clustering: Total swapping cost
                                                                                                     TCih=jCjih
  10                                                                                                                              10
  9                                                                                                                               9
                                                                                                                                                                                   j
  8
                           t                                                                                                      8
                                                                                                                                                           t
  7                                                                                                                               7
  5
                                                                           j                                                      6
  4
                                                       i                   h                                                      4
                                                                                                                                                                                                   h
  3
  2
                                                                                                                                  3
                                                                                                                                  2
                                                                                                                                                                                                   i
  1                                                                                                                               1
  0                                                                                                                               0
       0           1       2       3       4       5           6       7       8       9   10                                          0       1       2           3       4           5       6        7   8   9   10
                                                                                                             10
       10
                                                                                                             9
           9
           8
                               h                                                                             8
                                                               j
                                                                                                             7
           7
                                                                                                             6
           6
           5
                                                                                                             5                    i
                                                   i                                                                                                       h                           j
                                                                   t
                                                                                                             4
           4
                                                                                                             3
           3
           2
                                                                                                             2
                                                                                                             1
                                                                                                                                                               t
           1
                                                                                                             0
           0
                                                                                                                  0   1   2   3            4       5       6           7       8           9       10
               0       1       2       3       4           5       6       7       8       9    10
Threshold of
1 2 34 5
                    A B C D E
            MST Example
                            A       B
    A B C D E
A   0   1   2   2   3
B   1   0   2   4   3   E               C
C   2   2   0   1   5
D   2   4   1   0   3
E   3   3   5   3   0           D
Agglomerative Algorithm
                 Single Link
• View all items with links (distances) between them.
• Finds maximal connected components in this graph.
• Two clusters are merged if there is at least one edge
  which connects them.
• Uses threshold distances at each level.
• Could be agglomerative or divisive.
MST Single Link Algorithm
Single Link Clustering
                                       AGNES (Agglomerative
                                            Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical analysis packages, e.g., Splus
• Use the Single-Link method and the dissimilarity matrix.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
  0                                                 0                                                 0
       0   1   2   3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10
                                  DIANA (Divisive Analysis)
 10                                                                                                  10
                                                   10
  9                                                                                                  9
                                                   9
  8                                                                                                  8
                                                   8
  7                                                                                                  7
                                                   7
  6                                                                                                  6
                                                   6
  5                                                                                                  5
                                                   5
  4                                                                                                  4
                                                   4
  3                                                                                                  3
                                                   3
  2                                                                                                  2
                                                   2
  1                                                                                                  1
                                                   1
  0                                                                                                  0
                                                   0
      0   1   2   3   4   5   6   7   8   9   10                                                          0   1   2   3   4   5   6   7   8   9   10
                                                        0   1   2   3   4   5   6   7   8   9   10
                           Readings
•   CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic
    Modeling. George Karypis, Eui-Hong Han, Vipin Kumar, IEEE Computer
    32(8): 68-75, 1999 (http://glaros.dtc.umn.edu/gkhome/node/152)
•   BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining
    and Knowledge Discovery Volume 1 , Issue 2 (1997)