Clustering
Clustering
D
Medicine Weight pH-
Index
C
A 1 1
B 2 1
C 4 3 A B
D 5 4
1. Partitioning Methods: k-Means
K-means - Example
• Step 1: Use initial seed points for partitioning
c1 = A , c 2 = B
d( D , c1 ) = ( 5 − 1)2 + ( 4 − 1)2 = 5
d( D , c2 ) = ( 5 − 2)2 + ( 4 − 1)2 = 4.24
2 + 4 + 5 1+ 3+ 4
c 2 = ,
3 3
11 8
=( , )
3 3
1. Partitioning Methods: k-Means
K-means - Example
• Step 2: Renew membership based on new centroids
1+ 2 1+1 1
c1 = , = (1 , 1)
2 2 2
4+5 3+4 1 1
c2 = , = (4 , 3 )
2 2 2 2
1. Partitioning Methods: k-Means
K-means - Example
• Step 3: Repeat the first two steps until its convergence
• Major features:
✓Discover clusters of arbitrary shape
✓Handle noise
✓One scan
✓Need density parameters as termination condition
2. Density-Based Methods:DBSCAN
• DBSCAN (Density-Based Spatial Clustering of Applications with
Noise)
• The density of an object o can be measured by the number of
objects close to o.
• DBSCAN finds core objects, that is, objects that have dense
neighborhoods. It connects core objects and their neighborhoods to
form dense regions as clusters.
2. Density-Based Methods:DBSCAN
• Two parameters:
• Eps (ϵ): Maximum radius of the neighbourhood
• MinPts: Minimum number of points in an Eps-neighbourhood of that
point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-reachable from a
point q w.r.t. Eps, MinPts if
• p belongs to NEps(q)
p MinPts = 5
• core point condition:
Eps = 1 cm
|NEps (q)| ≥ MinPts q
29
2. Density-Based Methods:DBSCAN
• Density-Reachable and Density-Connected
• Density-reachable:
• A point p is density-reachable from a point p
q w.r.t. Eps, MinPts if there is a chain of p1
points p1, …, pn, p1 = q, pn = p such that q
pi+1 is directly density-reachable from pi
• Density-connected
• A point p is density-connected to a point q p q
w.r.t. Eps, MinPts if there is a point o such
that both, p and q are density-reachable o
from o w.r.t. Eps and MinPts
2. Density-Based Methods:DBSCAN
• Given ϵ represents the radius of the circles, and MinPts = 3.
• Of the labeled points, m,p,o,r are core objects because each is in an
ϵ -neighborhood containing at least three points. Object q is directly
density-reachable from m. Object m is directly density-reachable
from p and vice versa.
• Object q is (indirectly) density-reachable from p because q is directly
density-reachable from m and m is directly density-reachable from p.
However, p is not density-reachable from q because q is not a core
object.
2. Density-Based Methods:DBSCAN
• Relies on a density-based notion of cluster: A cluster is defined as a
maximal set of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
32
2. Density-Based Methods:DBSCAN
• Due to the fixed neighborhood size parameterized by ϵ, the density
of a neighborhood can be measured simply by the number of objects
in the neighborhood.
• To determine whether a neighborhood is dense or not, DBSCAN uses
another user-specified parameter, MinPts, which specifies the
density threshold of dense regions.
• An object is a core object if the ϵ-neighborhood of the object
contains at least MinPts objects.
• Core objects are the pillars of dense regions.
2. Density-Based Methods:DBSCAN
• Advantages
➢Clusters can have arbitrary shape and size
➢Number of clusters is determined automatically
➢Can separate clusters from surrounding noise
• Disadvantages
➢Input parameters may be difficult to determine
➢In some situations very sensitive to input parameter setting
2. Density-Based Methods:OPTICS
• DBSCAN required input parameters ϵ and MinPts, however, it is not
easy for the user to set these parameters.
• Such parameter settings are usually empirically set and difficult to
determine, especially for real-world, highdimensional datasets.
• Most algorithms are sensitive to these parameter values: Slightly
different settings may lead to very different clusterings of the data.
• To overcome the difficulty in using one set of global parameters in
clustering analysis, a cluster analysis method called OPTICS was
proposed.
• OPTICS does not explicitly produce a dataset clustering. Instead, it
outputs a cluster ordering.
2. Density-Based Methods:OPTICS
• Objects in a denser cluster are listed closer to each other in the
cluster ordering.
• This ordering is equivalent to density-based clustering obtained from
a wide range of parameter settings.
• Thus, OPTICS does not require the user to provide a specific density
threshold.
• To construct the different clusterings simultaneously, the objects are
processed in a specific order.
• This order selects an object that is density-reachable with respect to
the lowest ϵ value so that clusters with higher density (lower ϵ) will
be finished first.
2. Density-Based Methods:OPTICS
OPTICS needs two important pieces of information per object:
1. The core-distance of an object p is the smallest value ϵƴ such that the
ϵ-neighborhood
ƴ of p has at least MinPts objects.
• That is, ϵƴ is the minimum distance threshold that makes p a core
object. If p is not a core object with respect to ϵ and MinPts, the core-
distance of p is undefined.
2. The reachability-distance to object p from q is the minimum radius
value that makes p density-reachable from q.
• According to the definition of density-reachability, q has to be a core
object and p must be in the neighborhood of q.
• Therefore, the reachability-distance from q to p is max{core-
distance(q), dist(p, q)}. If q is not a core object with respect to ϵ and
MinPts, the reachability-distance to p from q is undefined.
3. Grid-Based Methods
• Grid-based clustering method takes a space-driven approach by
partitioning the embedding space into cells independent of the
distribution of the input objects.
• The grid-based clustering approach uses a multiresolution grid data
structure.
• It quantizes the object space into a finite number of cells that form a
grid structure on which all of the operations for clustering are
performed.
• The main advantage of the approach is its fast processing time,
which is typically independent of the number of data objects.
3. Grid-Based Methods: CLIQUE
• CLIQUE (CLustering In QUEst) is a simple grid-based method for
finding densitybased clusters in subspaces.
• CLIQUE partitions each dimension into nonoverlapping intervals,
thereby partitioning the entire embedding space of the data objects
into cells.
• It uses a density threshold to identify dense cells and sparse ones. A
cell is dense if the number of objects mapped to it exceeds the
density threshold.
3. Grid-Based Methods: CLIQUE
• CLIQUE is a density-based and grid-based subspace clustering
algorithm
• Grid-based: It discretizes the data space through a grid and estimates
the density by counting the number of points in a grid cell
• Density-based: A cluster is a maximal set of connected dense units in
a subspace
• A unit is dense if the fraction of total data points contained in the
unit exceeds the input model parameter.
• Subspace clustering: A subspace cluster is a set of neighboring dense
cells in an arbitrary subspace. It also discovers some minimal
descriptions of the clusters
• It automatically identifies subspaces of a high dimensional data
space that allow better clustering than original space using the
Apriori principle
3. Grid-Based Methods: CLIQUE
• Bottom-up approach
3. Grid-Based Methods: CLIQUE
• Bottom-up approach
Initialization:
Initialization: All objects stay in one cluster
Each object is a cluster Iteration:
Iteration: Select a cluster and split it into
Merge two clusters which are two sub clusters
most similar to each other; Until each leaf cluster contains
Until all objects are merged only one object
into a single cluster
4. Hierarchical
• Dendrogram representation for hierarchical clustering of data objects
{a,b,c,d,e}
Single-link
d min (Ci , C j ) = avg d ( p, q)
Complete-link pCi , qC j
Average-link The distance between two clusters is
Centroid distance represented by the average distance of all pairs
of data objects belonging to different clusters.
4. Hierarchical
How to Define Inter-Cluster Distance
mi,mj are the means of
Ci, Cj, respectively
a b c d e
1 2 4 5 6
1. Calculate the distance matrix. 2. Calculate three cluster distances between C1 and C2.
a b c d e Single link:
dist (C1 , C 2 ) = min{d(a, c) , d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
a 0 1 3 4 5 = min{3, 4, 5, 2, 3, 4} = 2
b 1 0 2 3 4 Complete link:
dist(C1 , C 2 ) = max{d(a, c) , d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
c 3 2 0 1 2
= max{3, 4, 5, 2, 3, 4} = 5
d 4 3 1 0 1
Average: d(a, c) + d(a, d) + d(a, e) + d(b, c) + d(b, d) + d(b, e)
e 5 4 2 1 0 dist(C1 , C 2 ) =
6
3 + 4 + 5 + 2 + 3 + 4 21
= = = 3.5
6 6
4. Hierarchical
Agglomerative Algorithm
• The Agglomerative algorithm is carried out in three steps:
data matrix
Euclidean distance
distance matrix
4. Hierarchical
• Merge two closest clusters (iteration 1)
4. Hierarchical
• Update distance matrix (iteration 1)
4. Hierarchical
• Merge two closest clusters (iteration 2)
4. Hierarchical
• Update distance matrix (iteration 2)
4. Hierarchical
• Merge two closest clusters/update distance matrix (iteration 3)
4. Hierarchical
• Merge two closest clusters/update distance matrix (iteration 4)
4. Hierarchical
• Final result (meeting termination condition)
4. Hierarchical
• Dendrogram tree representation
1. In the beginning we have 6
clusters: A, B, C, D, E and F
2. We merge clusters D and F into
6 cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B
into (A, B) at distance 0.71
lifetime
object
4. Hierarchical
Which Distance Measure is Better?
• Each method has both advantages and disadvantages; application-
dependent, single-link and complete-link are the most common
methods
• Single-link
• Can find irregular-shaped clusters
• Sensitive to outliers, suffers the so-called chaining effects
• In order to merge two groups, only need one pair of points to
be close, irrespective of all others. Therefore clusters can be
too spread out, and not compact enough
• Average-link, and Centroid distance
• Robust to outliers
• Tend to break large clusters
4. Hierarchical: AGNES
• AGNES : AGglomerative NESting
• Use single-link method
• Merge nodes that have the least dissimilarity
• Eventually all objects belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
4. Hierarchical: BIRCH
• BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies.
• BIRCH uses the notions of clustering feature to summarize a cluster,
and clustering feature tree (CF-tree) to represent a cluster hierarchy.
• These structures help the clustering method achieve good speed
and scalability in large or even streaming databases, and also make
it effective for incremental and dynamic clustering of incoming
objects.
4. Hierarchical: BIRCH
BIRCH: Key Components
• Clustering Feature (CF)
• Summary of the statistics for a given cluster: the 0-th, 1st and 2nd
moments of the cluster from the statistical point of view
• A CF entry has sufficient information to calculate the centroid,
radius, diameter and many other distance measures
• Additively theorem allows us to merge sub-clusters incrementally
• CF-Tree
• height-balanced tree
• two parameters:
• number of entries in each node
• The diameter of all entries in a leaf node
• Leaf nodes are connected via prev and next pointers
4. Hierarchical: BIRCH
Clustering Feature
• Clustering Feature (CF): CF = (N, LS, SS)
• N: Number of data points
R=( i =1 i 0
)1/ 2
• Centroid: N
σ𝑵
𝒊=𝟏 𝑿𝒊 𝑳𝑺
𝑿𝟎 = =
𝑵 𝑵
N 2 2
(Xi − 2X0 Xi + X0 )
=( i =1
)1/ 2
• Radius: average distance from member N
points to centroid
X i − 2i =1 X i X 0 + i =1 X 0 )
N 2 N N 2
=( i =1
)1/ 2
N
(Xi − X0) 2
N
R=( i =1
)1/ 2
X i − 2 X 0 i =1 X i + N X 0
N 2 N 2
N
• Diameter: average pair-wise distance =( i =1
)1/ 2
N
within a cluster
LS LS
SS − 2 LS + N ( ) 2
N N
(Xi − X j ) 2
N N )1/ 2
D=( i =1 j =1
)1/ 2 =(
N ( N − 1) N
4. Hierarchical: BIRCH
Clustering Feature:
• Suppose cluster C1 has CF1=(N1, LS1 ,SS1),
cluster C2 has CF2 =(N2,LS2,SS2) CF1= 〈3, (2+3+4 , 5+2+3), (𝟐𝟐 + 𝟑𝟐 + 𝟒𝟐 , 𝟓𝟐 + 𝟐𝟐 + 𝟑𝟐 )〉
• If we merge C1 with C2, the CF for the = 〈3, (9,10), (29 ,38)〉
merged cluster C is
CF = CF1 + CF2
= ( N1 + N 2 , LS1 + LS 2 , SS1 + SS 2 )
• Why CF?
• Summarized info for single cluster
CF3=CF1+CF2= 〈3+3, (9+35, 10+36), (29+417 , 38+440)〉
• Summarized info for two clusters = 〈6, (44,46), (446 ,478)〉
• Additive theorem
4. Hierarchical: BIRCH
CF-Tree:
• A CF tree is a height-balanced tree with two parameters: branching
factor B and threshold T.
• Branching factor:
• B : Each non-leaf node contains at most B entries of the form [CFi, childi] ,
where I = 1, 2, …, B, and childi is a pointer to its i-th child node. CFi Is the CF
of the sub-cluster represented by the childi.
• L : A leaf node contains at most L entries, each of the form [CFi], where I = 1,
2, …, L. In addition, each leaf node has two pointers, “prev” and “next”
which are used to chain all leaf node together
• A leaf node also represent a cluster made up of all the sub-clusters
represented by its entries.
• Threshold T:
• The diameter (alternatively, the radius) of all entries in a leaf node is at
most T
• Leaf nodes are connected via prev and next pointers
4. Hierarchical: BIRCH
Example of CF Tree: Root
B=6 CF1 CF2 CF3 CF6
Non-leaf node
CF9 CF10 CF11 CF13
Node 1 Node 2
CF1: n=2, Ls=0.75, SS = 0.313 CF2: n=1, Ls=0.0, SS = 0 CF3: n=1, Ls=0.65, SS = 0.423
𝒙𝟏 =0.5,
𝒙3 =0 𝒙4 =0.65
𝒙𝟐 = 0.25
4. Hierarchical: BIRCH
Example:
• The fifth data value 𝑥5 = 1 is entered.
• BIRCH compares 𝑥5 = 1 with the location of 𝐶𝐹12
and 𝐶𝐹3 .
• We have 𝑥𝐶𝐹12 = 0.75Τ3 = 𝟎. 𝟐𝟓 and 𝑥𝐶𝐹𝟑 = 0.65 ∕
1 = 0.65. The data point 𝑥5 = 1 is thus closer to
𝐶𝐹3 than to 𝐶𝐹12 .
• BIRCH passes 𝑥5 to 𝐶𝐹3 . The radius of 𝐶𝐹3 now
increases to 𝑅 = 0.175 > 𝑇 = 0.15, so 𝑥5 cannot be
assigned to 𝐶𝐹3 .
• Instead, a new leaf in leaf node Leaf4 is initialized,
with 𝐶𝐹, 𝐶𝐹4 , containing 𝑥5 only. The summary
statistics for 𝐶𝐹34 are updated.
• T = 0.15 𝒙𝟏 0.5
4. Hierarchical: BIRCH •L=2
𝒙𝟐 0.25
Example: 𝒙𝟑 0
•B=2 𝒙𝟒 0.65
Root-Node 0
𝒙𝟓 1
CF12: n=3, Ls=0.75, SS = 0.313 CF34: n=2, Ls=1.65, SS = 1.423 𝒙𝟔 1.4
𝒙𝟕 1.1
Node 1 Node 2
CF1: n=2, Ls=0.75, CF2: n=1, Ls=0.0, CF3: n=1, Ls=0.65, CF4: n=1, Ls=1, SS
SS = 0.313 SS = 0 SS = 0.423 =1
𝒙𝟏 =0.5, 𝒙3 =0 𝒙4 =0.65 𝒙5 =1
𝒙𝟐 = 0.25
83
4. Hierarchical: BIRCH
Example:
• The sixth data value 𝑥6 = 1.4 is entered. At the root node, BIRCH compares 𝑥6 =
1.4 with the location of 𝐶𝐹12 and 𝐶𝐹34 .
• We have 𝑥𝐶𝐹12 = 0.75 ∕ 3 = 0.25 and 𝑥𝐶𝐹34 = 1.65 ∕ 2 = 0.825. The data point
𝑥6 = 1.4 is thus closer to 𝐶𝐹34 , and BIRCH passes 𝑥6 to 𝐶𝐹34 .
• The record descends to Node2, and BIRCH compares 𝑥6 = 1.4 with the location
of 𝐶𝐹3 and 𝐶𝐹4 .
• We have 𝑥𝐶𝐹3 = 0.65 and 𝑥𝐶𝐹4 = 1. The data point 𝑥6 = 1.4 is thus closer to
𝐶𝐹4 than to 𝐶𝐹3 . BIRCH tentatively passes 𝑥6 to 𝐶𝐹4 .
• The radius of 𝐶𝐹4 now increases to 𝑅 = 0.2 > 𝑇 = 0.15. The Threshold value
T=0.15 is exceeded, so 𝑥6 is not assigned to 𝐶𝐹4 . But the branching factor B=2
means that we may have at most two leaf nodes branching off of any non-leaf
node. Therefore, we will need a new set of non-leaf nodes, Node2.1 and
Node2.2, branching off from Node2.
• Node2.1 contains 𝐶𝐹𝟑 and 𝐶𝐹4 , while Node2.2 contains the desired new 𝐶𝐹𝟓 and
the new leaf node Leaf5 as its only child, containing only the information from 𝑥6 .
• T = 0.15 𝒙𝟏 0.5
4. Hierarchical: BIRCH •L=2 𝒙𝟐 0.25
Example: 𝒙𝟑 0
•B=2 𝒙𝟒 0.65
Root-Node 0
𝒙𝟓 1
CF12: n=3, Ls=0.75, SS = 0.313 CF345: n=3, Ls=3.05, SS = 3.393
𝒙𝟔 1.4
𝒙𝟕 1.1
Node 1 Node 2
CF1: n=2, Ls=0.75, CF2: n=1, Ls=0.0, CF34: n=2, Ls=1.65, CF5: n=1, Ls=1.4,
SS = 0.313 SS = 0 SS = 1.423 SS = 1.96
𝒙4 =0.65 𝒙5 =1 𝒙6 =1.4
85
• T = 0.15 𝒙𝟏 0.5
4. Hierarchical: BIRCH •L=2 𝒙𝟐 0.25
Example: 𝒙𝟑 0
•B=2 𝒙𝟒 0.65
Root-Node 0
𝒙𝟓 1
CF12: n=3, Ls=0.75, SS = 0.313 CF345: n=4, Ls=4.15, SS = 4.212
𝒙𝟔 1.4
𝒙𝟕 1.1
Node 1 Node 2
CF1: n=2, Ls=0.75, CF2: n=1, Ls=0.0, CF34: n=3, Ls=2.75, CF5: n=1, Ls=1.4,
SS = 0.313 SS = 0 SS = 2.252 SS = 1.96
𝒙4 =0.65 𝒙5 =1 𝒙6 =1.4
𝒙7 =1.1
86
Summary
• Data sets are made up of data objects. A data object represents an
entity. Data objects are described by attributes. Attributes can be
nominal, binary, ordinal, or numeric.
• The values of a nominal (or categorical) attribute are symbols or names
of things, where each value represents some kind of category, code, or
state.
• Binary attributes are nominal attributes with only two possible states
(such as 1 and 0 or true and false). If the two states are equally
important, the attribute is symmetric; otherwise it is asymmetric.
• An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between
successive values is not known.
Summary
• A numeric attribute is quantitative (i.e., it is a measurable quantity)
represented in integer or real values. Numeric attribute types can be
interval-scaled or ratio-scaled. The values of an interval-scaled attribute
are measured in fixed and equal units. Ratio-scaled attributes are
numeric attributes with an inherent zero-point. Measurements are
ratio-scaled in that we can speak of values as being an order of
magnitude larger than the unit of measurement.
• Measures of object similarity and dissimilarity are used in data mining
applications such as clustering, outlier analysis, and nearest-neighbor
classification. Such measures of proximity can be computed for each
attribute type, or for combinations of such attributes. Examples include
the Jaccard coefficient for asymmetric binary attributes and Euclidean,
Manhattan, and Minkowski. For applications involving sparse numeric
data vectors, such as term-frequency vectors, the cosine measure.
References
• Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and
Techniques, Morgan Kaufmann, 3rd Edition.
• Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar, Introduction to
data mining, Pearson India.