KEMBAR78
Clustering | PDF | Cluster Analysis | Machine Learning
0% found this document useful (0 votes)
24 views89 pages

Clustering

Clustering is an unsupervised learning technique that partitions data into subsets called clusters, where objects in a cluster are similar to each other but dissimilar to those in other clusters. Major clustering methods include partitioning (e.g., k-means), hierarchical, density-based (e.g., DBSCAN), and grid-based methods, each with unique characteristics and applications. Clustering is widely used in various fields such as marketing, city planning, and climate studies to identify patterns and group similar data points.

Uploaded by

yijac51850
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views89 pages

Clustering

Clustering is an unsupervised learning technique that partitions data into subsets called clusters, where objects in a cluster are similar to each other but dissimilar to those in other clusters. Major clustering methods include partitioning (e.g., k-means), hierarchical, density-based (e.g., DBSCAN), and grid-based methods, each with unique characteristics and applications. Clustering is widely used in various fields such as marketing, city planning, and climate studies to identify patterns and group similar data points.

Uploaded by

yijac51850
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Clustering

Partitioning, Hierarchical, Density-Based


&
Grid-Based Methods
Clustering
• Clustering is the process of partitioning a set of data objects (or
observations) into subsets.
• Each subset is a cluster, such that objects in a cluster are similar to one
another, yet dissimilar to objects in other clusters..
• Clusters: objects within a cluster have high similarity, but are very
dissimilar to objects in other clusters.
• Dissimilarities and similarities are assessed based on the attribute
values describing.
• Clustering is known as unsupervised learning because the class label
information is not present. For this reason, clustering is a form of
learning by observation, rather than learning by examples.
Clustering
• Cluster analysis has been widely used in many applications such as:
• Information retrieval: document clustering
• Land use: Identification of areas of similar land use in an earth
observation database
• Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
• City-planning: Identifying groups of houses according to their house
type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
• Climate: understanding earth climate, find patterns of atmospheric and
ocean
• Economic Science: market research
Major Clustering Approaches
• There are many clustering algorithms in the literature. It is difficult to
provide a crisp categorization of clustering methods because these
categories may overlap so that a method may have features from
several categories.
• The major fundamental clustering methods can be classified into the
following four categories:
1. Partitioning method:
• Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors.
• Typical methods: k-means, k-medoids
2. Hierarchical method:
• Create a hierarchical decomposition of the set of data (or objects) using some
criterion.
• Typical methods: Diana, Agnes, BIRCH, CAMELEON
Major Clustering Approaches
• There are many clustering algorithms in the literature. It is difficult to
provide a crisp categorization of clustering methods because these
categories may overlap so that a method may have features from
several categories.
• The major fundamental clustering methods can be classified into the
following four categories:
3. Density-based method:
• Based on connectivity and density functions
• Typical methods: DBSACN, OPTICS, DenClue
4. Grid-based method:
• based on a multiple-level granularity structure
• Typical methods: STING, WaveCluster, CLIQUE
1. Partitioning Methods
• The simplest and most fundamental version of cluster analysis is
partitioning, which organizes the objects of a set into several exclusive
groups or clusters.
• Formally, given a dataset D, of n objects, and k, the number of clusters
to form, a partitioning algorithm organizes the objects into k partitions
(k ≤ n).
• The clusters are formed to optimize an objective partitioning criterion,
such as a dissimilarity function based on distance, so that the objects
within a cluster are “similar” to one another and “dissimilar” to objects
in other clusters in terms of the data set attributes.
1. Partitioning Methods: k-Means
• A centroid-based partitioning technique uses the centroid of a cluster
Ci, to represent that cluster.
• Conceptually, the centroid of a cluster is its center point.
• The centroid can be defined in various ways such as by the mean or
medoid of the objects (or points) assigned to the cluster.
• The difference between an object p ϵ Ci and ci, the representative of the
cluster, is measured by dist(p,ci), where dist(x,y) is the Euclidean
distance between two points x and y.
1. Partitioning Methods: k-Means
1. Partitioning Methods: k-Means
K-means - Example
• Problem:
• Suppose we have 4 types of medicines and each has two attributes (pH and
weight index). Our goal is to group these objects into K=2 group of medicine.

D
Medicine Weight pH-
Index
C
A 1 1

B 2 1

C 4 3 A B

D 5 4
1. Partitioning Methods: k-Means
K-means - Example
• Step 1: Use initial seed points for partitioning
c1 = A , c 2 = B

d( D , c1 ) = ( 5 − 1)2 + ( 4 − 1)2 = 5
d( D , c2 ) = ( 5 − 2)2 + ( 4 − 1)2 = 4.24

Assign each object to the cluster


with the nearest seed point
1. Partitioning Methods: k-Means
K-means - Example
• Step 2: Compute new centroids of the current partition
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
c1 = (1, 1)

2 + 4 + 5 1+ 3+ 4
c 2 =  , 
 3 3 
11 8
=( , )
3 3
1. Partitioning Methods: k-Means
K-means - Example
• Step 2: Renew membership based on new centroids

Compute the distance of all


objects to the new centroids

Assign the membership to objects


1. Partitioning Methods: k-Means
K-means - Example
• Step 3: Repeat the first two steps until its convergence
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.

 1+ 2 1+1 1
c1 =  ,  = (1 , 1)
 2 2  2
 4+5 3+4 1 1
c2 =  ,  = (4 , 3 )
 2 2  2 2
1. Partitioning Methods: k-Means
K-means - Example
• Step 3: Repeat the first two steps until its convergence

Compute the distance of all objects


to the new centroids

Stop due to no new assignment


Membership in each cluster no
longer change
1. Partitioning Methods: k-Means
Strengths of k-means
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(tkn), where n is the number of data
points, k is the number of clusters, and t is the number of iterations.
1. Partitioning Methods: k-Means
Weaknesses of k-means
• The algorithm is only applicable if the mean is defined.
• The user needs to specify k.
• The algorithm is sensitive to outliers
• Outliers are data points that are very far away from other data
points.
• Outliers could be errors in the data recording or some special data
points with very different values.
1. Partitioning Methods: k-Means
Weaknesses of k-means: Problems with outliers
1. Partitioning Methods: k-Means
Weaknesses of k-means: To deal with outliers
• One method is to remove some data points in the clustering process
that are much further away from the centroids than other data points.
• To be safe, we may want to monitor these possible outliers over a
few iterations and then decide to remove them.
• Another method is to perform random sampling. Since in sampling
we only choose a small subset of the data points, the chance of
selecting an outlier is very small.
• Assign the rest of the data points to the clusters by distance or
similarity comparison, or classification
1. Partitioning Methods: k-Means
Weaknesses of k-means
• The algorithm is sensitive to initial seeds.
1. Partitioning Methods: k-Means
Weaknesses of k-means
• If we use different seeds: good results
There are some
methods to help
choose good seeds
1. Partitioning Methods: k-Medoids
Partitioning Around Medoids (PAM) algorithm is a popular realization of
k-medoids clustering.
1. Partitioning Methods: k-Medoids
K-medoids Example
X1 2 6 Distances to X5 to X9
X2 3 4 X1 8 7
X3 3 8 X2 5 6
X4 5 7 X3 9 8
X5 6 2 Assume k=2 X4 7 6
X6 6 4 Select X5 and X9 as medoids
X6 2 3
X7 7 3
X7 2 3
X8 7 4
X8 3 2
X9 8 5
x10 5 2
x10 7 6

Current clustering: {X2,X5,X6,X7},{X1,X3,X4,X8,X9,X10}


1. Partitioning Methods: k-Medoids
K-medoids Example
• So, now let us choose some other point to be a medoid instead of X5 (6, 2). Let us
randomly choose X1 (2, 6).
• Not the new medoid set is: (2, 6) and (8, 5). Now repeating the same task as earlier:
Replace Before to X1 To X9 Change
X5 by X1 X1 2 6
X1 7 0 0 -7 X2 3 4
X2 5 3 6 -2 X3 3 8
X3 8 3 8 -5 X4 5 7
X4 6 4 6 -2 X5 6 2
X5 0 8 5 5 X6 6 4
X6 2 6 3 1 X7 7 3
X7 2 8 3 1 X8 7 4
X8 2 7 2 0 X9 8 5
X9 0 0 0 0 x10 7 6
x10 2 5 2 0
-9 Current clustering: {X1,X2,X3,X4},{X5,X6,X7,X8,X9,X10}
1. Partitioning Methods: k-Medoids
K-medoids Properties
• The complexity of each iteration is 𝑂(𝑘 𝑛 − 𝑘 2 )
• For large values of n and k, such computation becomes very costly
• Advantages
• K-Medoids method is more robust than k-Means in the presence
of noise and outliers
• Disadvantages
• K-Medoids is more costly that the k-Means method
• Like k-means, k-medoids requires the user to specify k
• It does not scale well for large data sets
2. Density-Based Methods
• Most partitioning methods cluster objects based on the distance
between objects. Such methods can find only spherical-shaped
clusters and encounter difficulty in discovering clusters of arbitrary
shapes.
• To find clusters of arbitrary shape, alternatively, we can model
clusters as dense regions in the data space, separated by sparse
regions.
2. Density-Based Methods
• The general idea is to continue growing a given cluster as long as the
density (number of objects or data points) in the “neighborhood”
exceeds some threshold.
• For example, for each data point within a given cluster, the
neighborhood of a given radius has to contain at least a minimum
number of points. Such a method can be used to filter out noise or
outliers and discover clusters of arbitrary shape.
2. Density-Based Methods
• Basic density-based clustering methods:
• DBSCAN, OPTICS and DENCLUE.

• Major features:
✓Discover clusters of arbitrary shape
✓Handle noise
✓One scan
✓Need density parameters as termination condition
2. Density-Based Methods:DBSCAN
• DBSCAN (Density-Based Spatial Clustering of Applications with
Noise)
• The density of an object o can be measured by the number of
objects close to o.
• DBSCAN finds core objects, that is, objects that have dense
neighborhoods. It connects core objects and their neighborhoods to
form dense regions as clusters.
2. Density-Based Methods:DBSCAN
• Two parameters:
• Eps (ϵ): Maximum radius of the neighbourhood
• MinPts: Minimum number of points in an Eps-neighbourhood of that
point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-reachable from a
point q w.r.t. Eps, MinPts if
• p belongs to NEps(q)
p MinPts = 5
• core point condition:
Eps = 1 cm
|NEps (q)| ≥ MinPts q
29
2. Density-Based Methods:DBSCAN
• Density-Reachable and Density-Connected
• Density-reachable:
• A point p is density-reachable from a point p
q w.r.t. Eps, MinPts if there is a chain of p1
points p1, …, pn, p1 = q, pn = p such that q
pi+1 is directly density-reachable from pi
• Density-connected
• A point p is density-connected to a point q p q
w.r.t. Eps, MinPts if there is a point o such
that both, p and q are density-reachable o
from o w.r.t. Eps and MinPts
2. Density-Based Methods:DBSCAN
• Given ϵ represents the radius of the circles, and MinPts = 3.
• Of the labeled points, m,p,o,r are core objects because each is in an
ϵ -neighborhood containing at least three points. Object q is directly
density-reachable from m. Object m is directly density-reachable
from p and vice versa.
• Object q is (indirectly) density-reachable from p because q is directly
density-reachable from m and m is directly density-reachable from p.
However, p is not density-reachable from q because q is not a core
object.
2. Density-Based Methods:DBSCAN
• Relies on a density-based notion of cluster: A cluster is defined as a
maximal set of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

32
2. Density-Based Methods:DBSCAN
• Due to the fixed neighborhood size parameterized by ϵ, the density
of a neighborhood can be measured simply by the number of objects
in the neighborhood.
• To determine whether a neighborhood is dense or not, DBSCAN uses
another user-specified parameter, MinPts, which specifies the
density threshold of dense regions.
• An object is a core object if the ϵ-neighborhood of the object
contains at least MinPts objects.
• Core objects are the pillars of dense regions.
2. Density-Based Methods:DBSCAN
• Advantages
➢Clusters can have arbitrary shape and size
➢Number of clusters is determined automatically
➢Can separate clusters from surrounding noise
• Disadvantages
➢Input parameters may be difficult to determine
➢In some situations very sensitive to input parameter setting
2. Density-Based Methods:OPTICS
• DBSCAN required input parameters ϵ and MinPts, however, it is not
easy for the user to set these parameters.
• Such parameter settings are usually empirically set and difficult to
determine, especially for real-world, highdimensional datasets.
• Most algorithms are sensitive to these parameter values: Slightly
different settings may lead to very different clusterings of the data.
• To overcome the difficulty in using one set of global parameters in
clustering analysis, a cluster analysis method called OPTICS was
proposed.
• OPTICS does not explicitly produce a dataset clustering. Instead, it
outputs a cluster ordering.
2. Density-Based Methods:OPTICS
• Objects in a denser cluster are listed closer to each other in the
cluster ordering.
• This ordering is equivalent to density-based clustering obtained from
a wide range of parameter settings.
• Thus, OPTICS does not require the user to provide a specific density
threshold.
• To construct the different clusterings simultaneously, the objects are
processed in a specific order.
• This order selects an object that is density-reachable with respect to
the lowest ϵ value so that clusters with higher density (lower ϵ) will
be finished first.
2. Density-Based Methods:OPTICS
OPTICS needs two important pieces of information per object:
1. The core-distance of an object p is the smallest value ϵƴ such that the
ϵ-neighborhood
ƴ of p has at least MinPts objects.
• That is, ϵƴ is the minimum distance threshold that makes p a core
object. If p is not a core object with respect to ϵ and MinPts, the core-
distance of p is undefined.
2. The reachability-distance to object p from q is the minimum radius
value that makes p density-reachable from q.
• According to the definition of density-reachability, q has to be a core
object and p must be in the neighborhood of q.
• Therefore, the reachability-distance from q to p is max{core-
distance(q), dist(p, q)}. If q is not a core object with respect to ϵ and
MinPts, the reachability-distance to p from q is undefined.
3. Grid-Based Methods
• Grid-based clustering method takes a space-driven approach by
partitioning the embedding space into cells independent of the
distribution of the input objects.
• The grid-based clustering approach uses a multiresolution grid data
structure.
• It quantizes the object space into a finite number of cells that form a
grid structure on which all of the operations for clustering are
performed.
• The main advantage of the approach is its fast processing time,
which is typically independent of the number of data objects.
3. Grid-Based Methods: CLIQUE
• CLIQUE (CLustering In QUEst) is a simple grid-based method for
finding densitybased clusters in subspaces.
• CLIQUE partitions each dimension into nonoverlapping intervals,
thereby partitioning the entire embedding space of the data objects
into cells.
• It uses a density threshold to identify dense cells and sparse ones. A
cell is dense if the number of objects mapped to it exceeds the
density threshold.
3. Grid-Based Methods: CLIQUE
• CLIQUE is a density-based and grid-based subspace clustering
algorithm
• Grid-based: It discretizes the data space through a grid and estimates
the density by counting the number of points in a grid cell
• Density-based: A cluster is a maximal set of connected dense units in
a subspace
• A unit is dense if the fraction of total data points contained in the
unit exceeds the input model parameter.
• Subspace clustering: A subspace cluster is a set of neighboring dense
cells in an arbitrary subspace. It also discovers some minimal
descriptions of the clusters
• It automatically identifies subspaces of a high dimensional data
space that allow better clustering than original space using the
Apriori principle
3. Grid-Based Methods: CLIQUE
• Bottom-up approach
3. Grid-Based Methods: CLIQUE
• Bottom-up approach

• Apriori principle: If a collection of points S is a cluster in a k-


dimensional space, then S is also part of a cluster in any (k-1)
dimensional projections of this space
3. Grid-Based Methods: CLIQUE
Major Steps of the CLIQUE Algorithm:
• Identify subspaces that contain clusters
• Partition the data space and find the number of points that lie
inside each cell of the partition
• Identify the subspaces that contain clusters using the Apriori
principle
• Identify clusters
• Determine dense units in all subspaces of interests
• Determine connected dense units in all subspaces of interests
• Generate minimal descriptions for the clusters
• Determine maximal regions that cover a cluster of connected
dense units for each cluster
• Determine minimal cover for each cluster
3. Grid-Based Methods: CLIQUE
Strengths
• Automatically finds subspaces of the highest dimensionality as long
as high density clusters exist in those subspaces
• Insensitive to the order of records in input and does not presume
some canonical data distribution
• Scales linearly with the size of input and has good scalability as the
number of dimensions in the data increases O(Ck + mk)
• Simple method and interpretability of results
Weaknesses
• As in all grid-based clustering approaches, the quality of the results
crucially depends on the
• appropriate choice of the number and width of the partitions and
grid cells
4. Hierarchical
• A hierarchical clustering method works by grouping data objects
into a hierarchy or “tree” of clusters.
4. Hierarchical
Agglomerative versus Divisive Hierarchical Clustering:
• An agglomerative hierarchical clustering method uses a bottom-up
strategy.
• It typically starts by letting each object form its own cluster and
iteratively merges clusters into larger and larger clusters, until all the
objects are in a single cluster or certain termination conditions are
satisfied.

• A divisive hierarchical clustering method employs a top-down


strategy.
• It starts by placing all objects in one cluster, which is the hierarchy’s
root.
4. Hierarchical
Agglomerative versus Divisive Hierarchical Clustering:

Initialization:
Initialization: All objects stay in one cluster
Each object is a cluster Iteration:
Iteration: Select a cluster and split it into
Merge two clusters which are two sub clusters
most similar to each other; Until each leaf cluster contains
Until all objects are merged only one object
into a single cluster
4. Hierarchical
• Dendrogram representation for hierarchical clustering of data objects
{a,b,c,d,e}

• Each node on the tree is a cluster; each leaf node is a singleton


cluster
4. Hierarchical
How to Merge Clusters?
• How to measure the distance between clusters?
Single-link
Complete-link
Distance?
Average-link
Centroid distance

Hint: Distance between clusters is usually defined


on the basis of distance between objects.
4. Hierarchical
How to Define Inter-Cluster Distance

Single-link d min (Ci , C j ) = min d ( p, q)


pCi , qC j
Complete-link
Average-link The distance between two clusters is
represented by the distance of the closest pair of
Centroid distance data objects belonging to different clusters.
4. Hierarchical
How to Define Inter-Cluster Distance

Single-link d min (Ci , C j ) = max d ( p, q )


pCi , qC j
Complete-link
The distance between two clusters is
Average-link represented by the distance of the farthest pair
Centroid distance of data objects belonging to different clusters.
4. Hierarchical
How to Define Inter-Cluster Distance

Single-link
d min (Ci , C j ) = avg d ( p, q)
Complete-link pCi , qC j
Average-link The distance between two clusters is
Centroid distance represented by the average distance of all pairs
of data objects belonging to different clusters.
4. Hierarchical
How to Define Inter-Cluster Distance

 
mi,mj are the means of
Ci, Cj, respectively

Single-link d mean (Ci , C j ) = d (mi , m j )


Complete-link
Average-link The distance between two clusters is
represented by the distance between the
Centroid distance means of the cluters.
4. Hierarchical
Cluster Distance Measures
Example: Given a data set of five objects characterized by a single continuous feature, assume that there are
two clusters: C1: {a, b} and C2: {c, d, e}.

a b c d e
1 2 4 5 6

1. Calculate the distance matrix. 2. Calculate three cluster distances between C1 and C2.
a b c d e Single link:
dist (C1 , C 2 ) = min{d(a, c) , d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
a 0 1 3 4 5 = min{3, 4, 5, 2, 3, 4} = 2

b 1 0 2 3 4 Complete link:
dist(C1 , C 2 ) = max{d(a, c) , d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
c 3 2 0 1 2
= max{3, 4, 5, 2, 3, 4} = 5
d 4 3 1 0 1
Average: d(a, c) + d(a, d) + d(a, e) + d(b, c) + d(b, d) + d(b, e)
e 5 4 2 1 0 dist(C1 , C 2 ) =
6
3 + 4 + 5 + 2 + 3 + 4 21
= = = 3.5
6 6
4. Hierarchical
Agglomerative Algorithm
• The Agglomerative algorithm is carried out in three steps:

1) Convert all object features into a


distance matrix
2) Set each object as a cluster (thus if
we have N objects, we will have N
clusters at the beginning)
3) Repeat until number of cluster is one
(or known # of clusters)
▪ Merge two closest clusters
▪ Update “distance matrix”
4. Hierarchical
• Problem: clustering analysis with agglomerative algorithm

data matrix

Euclidean distance

distance matrix
4. Hierarchical
• Merge two closest clusters (iteration 1)
4. Hierarchical
• Update distance matrix (iteration 1)
4. Hierarchical
• Merge two closest clusters (iteration 2)
4. Hierarchical
• Update distance matrix (iteration 2)
4. Hierarchical
• Merge two closest clusters/update distance matrix (iteration 3)
4. Hierarchical
• Merge two closest clusters/update distance matrix (iteration 4)
4. Hierarchical
• Final result (meeting termination condition)
4. Hierarchical
• Dendrogram tree representation
1. In the beginning we have 6
clusters: A, B, C, D, E and F
2. We merge clusters D and F into
6 cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B
into (A, B) at distance 0.71
lifetime

4. We merge clusters E and (D, F)


5 into ((D, F), E) at distance 1.00
5. We merge clusters ((D, F), E) and C
4 into (((D, F), E), C) at distance 1.41
3 6. We merge clusters (((D, F), E), C)
2
and (A, B) into ((((D, F), E), C), (A, B))
at distance 2.50
7. The last cluster contain all the objects,
object thus conclude the computation
4. Hierarchical
• Dendrogram tree representation

• For a dendrogram tree, its horizontal axis


indexes all objects in a given data set, while
6
its vertical axis expresses the lifetime of all
possible cluster formation.
lifetime

5 • The lifetime of a cluster (individual cluster)


in the dendrogram is defined as a distance
4
interval from the moment that the cluster is
3
2 created to the moment that it disappears by
merging with other clusters.

object
4. Hierarchical
Which Distance Measure is Better?
• Each method has both advantages and disadvantages; application-
dependent, single-link and complete-link are the most common
methods
• Single-link
• Can find irregular-shaped clusters
• Sensitive to outliers, suffers the so-called chaining effects
• In order to merge two groups, only need one pair of points to
be close, irrespective of all others. Therefore clusters can be
too spread out, and not compact enough
• Average-link, and Centroid distance
• Robust to outliers
• Tend to break large clusters
4. Hierarchical: AGNES
• AGNES : AGglomerative NESting
• Use single-link method
• Merge nodes that have the least dissimilarity
• Eventually all objects belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
4. Hierarchical: BIRCH
• BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies.
• BIRCH uses the notions of clustering feature to summarize a cluster,
and clustering feature tree (CF-tree) to represent a cluster hierarchy.
• These structures help the clustering method achieve good speed
and scalability in large or even streaming databases, and also make
it effective for incremental and dynamic clustering of incoming
objects.
4. Hierarchical: BIRCH
BIRCH: Key Components
• Clustering Feature (CF)
• Summary of the statistics for a given cluster: the 0-th, 1st and 2nd
moments of the cluster from the statistical point of view
• A CF entry has sufficient information to calculate the centroid,
radius, diameter and many other distance measures
• Additively theorem allows us to merge sub-clusters incrementally
• CF-Tree
• height-balanced tree
• two parameters:
• number of entries in each node
• The diameter of all entries in a leaf node
• Leaf nodes are connected via prev and next pointers
4. Hierarchical: BIRCH
Clustering Feature
• Clustering Feature (CF): CF = (N, LS, SS)
• N: Number of data points

• LS: linear sum of N points: σ𝑵


𝒊=𝟏 𝑿𝒊

• SS: square sum of N points: σ𝑵 𝑿𝟐


𝒊=𝟏 𝒊
4. Hierarchical: BIRCH
Distance Measures:
• Given a cluster with data points , 
N
( X − X ) 2

R=( i =1 i 0
)1/ 2
• Centroid: N
σ𝑵
𝒊=𝟏 𝑿𝒊 𝑳𝑺
𝑿𝟎 = =
𝑵 𝑵

N 2 2
(Xi − 2X0 Xi + X0 )
=( i =1
)1/ 2
• Radius: average distance from member N
points to centroid
 X i − 2i =1 X i X 0 + i =1 X 0 )
N 2 N N 2

=( i =1
)1/ 2

N
(Xi − X0) 2
N
R=( i =1
)1/ 2
 X i − 2 X 0 i =1 X i + N X 0
N 2 N 2
N
• Diameter: average pair-wise distance =( i =1
)1/ 2
N
within a cluster
LS LS
SS − 2  LS + N ( ) 2
 
N N
(Xi − X j ) 2
N N )1/ 2
D=( i =1 j =1
)1/ 2 =(
N ( N − 1) N
4. Hierarchical: BIRCH
Clustering Feature:
• Suppose cluster C1 has CF1=(N1, LS1 ,SS1),
cluster C2 has CF2 =(N2,LS2,SS2) CF1= 〈3, (2+3+4 , 5+2+3), (𝟐𝟐 + 𝟑𝟐 + 𝟒𝟐 , 𝟓𝟐 + 𝟐𝟐 + 𝟑𝟐 )〉
• If we merge C1 with C2, the CF for the = 〈3, (9,10), (29 ,38)〉
merged cluster C is
CF = CF1 + CF2
= ( N1 + N 2 , LS1 + LS 2 , SS1 + SS 2 )
• Why CF?
• Summarized info for single cluster
CF3=CF1+CF2= 〈3+3, (9+35, 10+36), (29+417 , 38+440)〉
• Summarized info for two clusters = 〈6, (44,46), (446 ,478)〉
• Additive theorem
4. Hierarchical: BIRCH
CF-Tree:
• A CF tree is a height-balanced tree with two parameters: branching
factor B and threshold T.
• Branching factor:
• B : Each non-leaf node contains at most B entries of the form [CFi, childi] ,
where I = 1, 2, …, B, and childi is a pointer to its i-th child node. CFi Is the CF
of the sub-cluster represented by the childi.
• L : A leaf node contains at most L entries, each of the form [CFi], where I = 1,
2, …, L. In addition, each leaf node has two pointers, “prev” and “next”
which are used to chain all leaf node together
• A leaf node also represent a cluster made up of all the sub-clusters
represented by its entries.
• Threshold T:
• The diameter (alternatively, the radius) of all entries in a leaf node is at
most T
• Leaf nodes are connected via prev and next pointers
4. Hierarchical: BIRCH
Example of CF Tree: Root
B=6 CF1 CF2 CF3 CF6

L=5 child1 child2 child3 child6

Non-leaf node
CF9 CF10 CF11 CF13

child1 child2 child3 child5

Leaf node Leaf node


prev CF90 CF91 CF94 next prev CF95 CF96 CF98 next
4. Hierarchical: BIRCH
CF Tree Insertion:
• Given entry “Ent”, it proceeds as below:
• Identifying the appropriate leaf
• recursively descending the CF tree and choosing the closest child node
according to a chosen distance metric
• Modifying the leaf:
• When it reaches a leaf node, it finds the closest leaf entry, say 𝐿𝑖 , and
then test whether 𝐿𝑖 can absorb the node without violating the
threshold. If so, the CF vector for 𝐿𝑖 is updated to reflect this.
• If not, a new entry for “Ent” is added to the leaf. If there is no room,
split the node
• Modifying the path:
• update CF information up the path.
4. Hierarchical: BIRCH
Example:
𝒙𝟏 0.5
• Enter the first data 𝒙𝟏
𝒙𝟐 0.25 • The root node is initialized with the CF values of
the first data value
𝒙𝟑 0
𝒙𝟒 0.65
• A new leaf Leaf1 is created, and BIRCH assigns the
first record 𝒙𝟏 to Leaf1
𝒙𝟓 1
• Leaf1 contains only one record and hence the
𝒙𝟔 1.4 radius is zero and thus less than T=0.15
𝒙𝟕 1.1 Root-Node 0
CF1: n=1, Ls=0.5, SS = 0.25
• T = 0.15
• L=2
Leaf1:R=0
• B=2
𝒙𝟏 =0.5
4. Hierarchical: BIRCH
Example:
𝒙𝟏 0.5 • The second data value 𝒙𝟐 =0.25 is entered.
𝒙𝟐 0.25 • BIRCH tentatively passes 𝒙𝟐 =0.25 to Leaf1.
𝒙𝟑 0
𝒙𝟒 0.65
• The radius of Leaf1 is now R = 0.126 < T = 0.15,
𝒙𝟓 1
so 𝒙𝟐 is assigned to Leaf1.
𝒙𝟔 1.4
𝒙𝟕 1.1 Root-Node 0 Root-Node 0
CF1: n=1, Ls=0.5, SS = 0.25 CF1: n=2, Ls=0.75, SS = 0.313
• T = 0.15
• L=2
Leaf1:R=0 Leaf1:R=0.126
• B=2
𝒙𝟏 =0.5,
𝒙𝟏 =0.5
𝒙𝟐 = 0.25
4. Hierarchical: BIRCH
Example:
• The third data value 𝒙𝟑 = 𝟎 is entered.
• BIRCH tentatively passes 𝒙𝟑 = 𝟎 to Leaf1.
• However, the radius of Leaf1 now increases to 𝑅 = 0.205 > 𝑇 = 0.15. The Threshold value
T=0.15 is exceeded, so x3 is not assigned to Leaf1.
• Instead, a new leaf is initialized, called Leaf2, containing x3 only. Node splitting is done by
choosing the farthest pair of entries as seeds, and redistributing the remaining entries based on
the closest criteria
Root-Node 0
𝒙𝟏 0.5 CF1: n=2, Ls=0.75, SS = 0.313 CF2: n=1, Ls=0.0, SS = 0
𝒙𝟐 0.25
𝒙𝟑 0
𝒙𝟒 0.65 • T = 0.15
𝒙𝟓 1 •L=2 Leaf1:R=0.126 Leaf2:R=0
𝒙𝟔 1.4
•B=2 𝒙𝟏 =0.5,
𝒙3 =0
𝒙𝟕 1.1 𝒙𝟐 = 0.25
4. Hierarchical: BIRCH
Example:

• The fourth data value 𝒙𝟒 = 𝟎. 𝟔𝟓 is entered. BIRCH compares 𝒙𝟒 to the


locations of CF1 and CF2.
𝑳𝑺 𝟎.𝟕𝟓
• The location is measured by 𝒙 ഥ = . We have 𝒙 ഥ𝑪𝑭𝟏 = = 𝟎. 𝟑𝟕𝟓 and
𝟎 𝒏 𝟐
ഥ𝑪𝑭𝟐 = = 𝟎
𝒙
𝟏
• The data point 𝒙𝟒 = 𝟎. 𝟔𝟓 is thus closer to CF1 than to CF2. Thus, BIRCH
tentatively passes x4 to CF1.
• The radius of CF1 now increases to R=0.166>T=0.15. The Threshold value
T=0.15 is exceeded, so 𝒙𝟒 is not assigned to CF1. Instead, we would like to
initialize a new leaf.
• However, B=2 means that we cannot have three leaf's in a leaf node. We
must therefore split the root node into (i) Node1, which has as its children
Leaf1 and Leaf2, and (ii) Node2, whose only leaf Leaf3 contains only 𝒙𝟒
• T = 0.15 𝒙𝟏 0.5
4. Hierarchical: BIRCH •L=2
𝒙𝟐 0.25
Example: 𝒙𝟑 0
•B=2 𝒙𝟒 0.65
Root-Node 0 𝒙𝟓 1
CF12: n=3, Ls=0.75, SS = 0.313 CF3: n=1, Ls=0.65, SS = 0.423 𝒙𝟔 1.4
𝒙𝟕 1.1

Node 1 Node 2
CF1: n=2, Ls=0.75, SS = 0.313 CF2: n=1, Ls=0.0, SS = 0 CF3: n=1, Ls=0.65, SS = 0.423

Leaf1:R=0.126 Leaf2:R=0. Leaf3:R=0.

𝒙𝟏 =0.5,
𝒙3 =0 𝒙4 =0.65
𝒙𝟐 = 0.25
4. Hierarchical: BIRCH
Example:
• The fifth data value 𝑥5 = 1 is entered.
• BIRCH compares 𝑥5 = 1 with the location of 𝐶𝐹12
and 𝐶𝐹3 .
• We have 𝑥𝐶𝐹12 = 0.75Τ3 = 𝟎. 𝟐𝟓 and 𝑥𝐶𝐹𝟑 = 0.65 ∕
1 = 0.65. The data point 𝑥5 = 1 is thus closer to
𝐶𝐹3 than to 𝐶𝐹12 .
• BIRCH passes 𝑥5 to 𝐶𝐹3 . The radius of 𝐶𝐹3 now
increases to 𝑅 = 0.175 > 𝑇 = 0.15, so 𝑥5 cannot be
assigned to 𝐶𝐹3 .
• Instead, a new leaf in leaf node Leaf4 is initialized,
with 𝐶𝐹, 𝐶𝐹4 , containing 𝑥5 only. The summary
statistics for 𝐶𝐹34 are updated.
• T = 0.15 𝒙𝟏 0.5
4. Hierarchical: BIRCH •L=2
𝒙𝟐 0.25
Example: 𝒙𝟑 0
•B=2 𝒙𝟒 0.65
Root-Node 0
𝒙𝟓 1
CF12: n=3, Ls=0.75, SS = 0.313 CF34: n=2, Ls=1.65, SS = 1.423 𝒙𝟔 1.4
𝒙𝟕 1.1

Node 1 Node 2
CF1: n=2, Ls=0.75, CF2: n=1, Ls=0.0, CF3: n=1, Ls=0.65, CF4: n=1, Ls=1, SS
SS = 0.313 SS = 0 SS = 0.423 =1

Leaf1:R=0.126 Leaf2:R=0. Leaf3:R=0. Leaf4:R=0.

𝒙𝟏 =0.5, 𝒙3 =0 𝒙4 =0.65 𝒙5 =1
𝒙𝟐 = 0.25

83
4. Hierarchical: BIRCH
Example:
• The sixth data value 𝑥6 = 1.4 is entered. At the root node, BIRCH compares 𝑥6 =
1.4 with the location of 𝐶𝐹12 and 𝐶𝐹34 .
• We have 𝑥𝐶𝐹12 = 0.75 ∕ 3 = 0.25 and 𝑥𝐶𝐹34 = 1.65 ∕ 2 = 0.825. The data point
𝑥6 = 1.4 is thus closer to 𝐶𝐹34 , and BIRCH passes 𝑥6 to 𝐶𝐹34 .
• The record descends to Node2, and BIRCH compares 𝑥6 = 1.4 with the location
of 𝐶𝐹3 and 𝐶𝐹4 .
• We have 𝑥𝐶𝐹3 = 0.65 and 𝑥𝐶𝐹4 = 1. The data point 𝑥6 = 1.4 is thus closer to
𝐶𝐹4 than to 𝐶𝐹3 . BIRCH tentatively passes 𝑥6 to 𝐶𝐹4 .
• The radius of 𝐶𝐹4 now increases to 𝑅 = 0.2 > 𝑇 = 0.15. The Threshold value
T=0.15 is exceeded, so 𝑥6 is not assigned to 𝐶𝐹4 . But the branching factor B=2
means that we may have at most two leaf nodes branching off of any non-leaf
node. Therefore, we will need a new set of non-leaf nodes, Node2.1 and
Node2.2, branching off from Node2.
• Node2.1 contains 𝐶𝐹𝟑 and 𝐶𝐹4 , while Node2.2 contains the desired new 𝐶𝐹𝟓 and
the new leaf node Leaf5 as its only child, containing only the information from 𝑥6 .
• T = 0.15 𝒙𝟏 0.5
4. Hierarchical: BIRCH •L=2 𝒙𝟐 0.25
Example: 𝒙𝟑 0
•B=2 𝒙𝟒 0.65
Root-Node 0
𝒙𝟓 1
CF12: n=3, Ls=0.75, SS = 0.313 CF345: n=3, Ls=3.05, SS = 3.393
𝒙𝟔 1.4
𝒙𝟕 1.1
Node 1 Node 2
CF1: n=2, Ls=0.75, CF2: n=1, Ls=0.0, CF34: n=2, Ls=1.65, CF5: n=1, Ls=1.4,
SS = 0.313 SS = 0 SS = 1.423 SS = 1.96

Leaf1:R=0.126 Leaf2:R=0. Node 2.1 Node 2.2


CF3: n=1, Ls=0.65, CF4: n=1, Ls=1, CF5: n=1, Ls=1.4,
𝒙𝟏 =0.5, SS = 0.423 SS = 1 SS = 1.96
𝒙3 =0
𝒙𝟐 = 0.25
Leaf3:R=0. Leaf4:R=0. Leaf5:R=0.

𝒙4 =0.65 𝒙5 =1 𝒙6 =1.4

85
• T = 0.15 𝒙𝟏 0.5
4. Hierarchical: BIRCH •L=2 𝒙𝟐 0.25
Example: 𝒙𝟑 0
•B=2 𝒙𝟒 0.65
Root-Node 0
𝒙𝟓 1
CF12: n=3, Ls=0.75, SS = 0.313 CF345: n=4, Ls=4.15, SS = 4.212
𝒙𝟔 1.4
𝒙𝟕 1.1
Node 1 Node 2
CF1: n=2, Ls=0.75, CF2: n=1, Ls=0.0, CF34: n=3, Ls=2.75, CF5: n=1, Ls=1.4,
SS = 0.313 SS = 0 SS = 2.252 SS = 1.96

Leaf1:R=0.126 Leaf2:R=0. Node 2.1 Node 2.2


CF3: n=1, Ls=0.65, CF4: n=2, Ls=2.1, CF5: n=1, Ls=1.4,
𝒙𝟏 =0.5, SS = 0.423 SS = 2.21 SS = 1.96
𝒙3 =0
𝒙𝟐 = 0.25
Leaf3:R=0. Leaf4:R=0.05 Leaf5:R=0.

𝒙4 =0.65 𝒙5 =1 𝒙6 =1.4
𝒙7 =1.1
86
Summary
• Data sets are made up of data objects. A data object represents an
entity. Data objects are described by attributes. Attributes can be
nominal, binary, ordinal, or numeric.
• The values of a nominal (or categorical) attribute are symbols or names
of things, where each value represents some kind of category, code, or
state.
• Binary attributes are nominal attributes with only two possible states
(such as 1 and 0 or true and false). If the two states are equally
important, the attribute is symmetric; otherwise it is asymmetric.
• An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between
successive values is not known.
Summary
• A numeric attribute is quantitative (i.e., it is a measurable quantity)
represented in integer or real values. Numeric attribute types can be
interval-scaled or ratio-scaled. The values of an interval-scaled attribute
are measured in fixed and equal units. Ratio-scaled attributes are
numeric attributes with an inherent zero-point. Measurements are
ratio-scaled in that we can speak of values as being an order of
magnitude larger than the unit of measurement.
• Measures of object similarity and dissimilarity are used in data mining
applications such as clustering, outlier analysis, and nearest-neighbor
classification. Such measures of proximity can be computed for each
attribute type, or for combinations of such attributes. Examples include
the Jaccard coefficient for asymmetric binary attributes and Euclidean,
Manhattan, and Minkowski. For applications involving sparse numeric
data vectors, such as term-frequency vectors, the cosine measure.
References
• Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and
Techniques, Morgan Kaufmann, 3rd Edition.
• Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar, Introduction to
data mining, Pearson India.

You might also like