Clustering
Introduction
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Clustering is unsupervised classification:
no predefined classes
• Typical applications
– As a stand-alone tool to get insight into data
distribution
– As a preprocessing step for other algorithms
1
Examples of Clustering
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
• Financial application We might wish to find clusters of
companies that have similar financial perfomance
• Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
• Medical Application: We might wish to find clusters of
patients with similar symptoms.
• Document Retrieval We might wish to find clusters of
documents with related content
2
Good Clustering
• A good clustering method will produce clusters with
– High intra-class similarity
– Low inter-class similarity
• Precise definition of clustering quality is difficult
– Application-dependent
– Ultimately subjective
3
Major Clustering Approaches
• Partitioning: Construct various partitions and then evaluate
them by some criterion, K means, k mediodes.
• Hierarchical: Create a hierarchical decomposition of the set
of objects using some criterion, Example, Agglomerative
• Model-based: Hypothesize a model for each cluster and
find best fit of models to data, Example Expectation
minimization
• Density-based: Guided by connectivity and density
functions, Example DBSCAN, OPTICS, DenClue
4
Partitioning Algorithms
• Partitioning method: Construct a partition of a database D of
n objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen, 1967): Each cluster is represented
by the center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman
& Rousseeuw, 1987): Each cluster is represented by one
of the objects in the cluster
5
K means clustering, The concept of Center (Centroid)
Assuming that we are using Euclidean distance or something
similar as a measure we can define the centroid of a cluster
to be the point for which each attribute value is the average
of the values of the corresponding attribute for all the points
in the cluster.
So the centroid of the four points (with 6 attributes)
is
6
K means clustering, The concept of Center (Centroid)
• The centroid of a cluster will sometimes be one of the
points in the cluster, but frequently, as in the above
example, it will be an ‘imaginary’ point, not part of the
cluster itself, which we can take as marking its center.
7
K-Means Clustering
• Given k, the k-means algorithm consists of four steps:
– Select initial centroids at random.
– Assign each object to the cluster with the nearest
centroid.
– Compute each centroid as the mean of the objects
assigned to it.
– Repeat previous 2 steps until no change.
8
K-Means Clustering
Distance Measurement
Example: K-Means Clustering
• We will illustrate the
k-means algorithm
by using it to cluster
the 16 objects with
two attributes x and
y, as shown.
• These points are
shown in a two
dimensional plane
on the next slide.
10
Example: K-Means Clustering
11
Example: K-Means Clustering
Three of the points shown in the
table have been surrounded by
small circles. We will assume that
we have chosen k = 3 and that these
three points have been selected to
be the locations of the initial three
centroids. This initial (fairly
arbitrary) choice is shown in Figure
on the previous slide.
12
Example: K-Means Clustering
13
Example: K-Means Clustering
The columns headed d1,
d2 and d3 in this table
show the Euclidean
distance of each of the 16
points from the three
centroids.
The column headed
‘cluster’ indicates the
centroid closest to each
point and thus the cluster
to which it should be
assigned.
14
Example: K-Means Clustering
The resulting clusters have been shown and they are actual
points within the clusters. The pervious centroids were not
true centroids.
15
Example: K-Means Clustering
We next calculate the centroids of the three clusters using the
x and y values of the objects currently assigned to each one.
The three centroids have all been moved by the assignment
process, but the movement of the third one is appreciably less
than for the other two.
16
Example: K-Means Clustering
The next step is to reassign the 16 objects to the three clusters
by determining which centroid is closest to each one. This
gives the revised set of clusters as shown below. However the
new centroids are not real ones (not actual points). The object
at (8.3, 6.9) has moved from cluster 2 to cluster 1.
17
Example: K-Means Clustering
We next recalculate the positions of the three centroids, giving
the set of new centroids, as shown below. The first two
centroids have moved a little, but the third has not moved at
all.
18
Example: K-Means Clustering
The next step is to assign 16 objects to clusters, again.
These are the same clusters as before. Their centroids will be
the same as those from which the clusters were generated.
Hence the stopping criterion has been met.
19
Example: K-Means Clustering
The three clusters, for the initially (randomly) chosen three
centroids have been formed.
It is now clear that the formation of the clusters is heavily
dependent on the number of centroids as well as the initial
choice of the centroids.
20
Choosing the best possible k
k-Means has no in-built preference for right number of
clusters, following are some of the common ways k can be
selected.
1.Domain Knowledge – If the problem requires/ prefers
certain number or range of clusters, then that can be useful to
select k. For instance, business may prefer three customer
segments H/M/L.
2.Rule of Thumb – Very rough rule of thumb is, where n is
number of data points, but in reality this is never really useful.
This rule gives
21
Choosing the best possible k
3. Cluster Quality using Silhouette Coefficient
The silhouette coefficient is a measure of the compactness and
separation of the clusters. It increases as the quality of the
clusters increase; it is large for compact clusters that are far
from each other and small for large, overlapping clusters.
The silhouette coefficient is calculated per instance; for a set
of instances, it is calculated as the mean of the individual
samples' scores. The silhouette coefficient for an instance is
calculated with the following equation:
22
Silhouette Coefficient
Cluster Quality using Silhouette Coefficient For the ith object
in a cluster A, calculate its average distance to all other
objects in its cluster. This gives us ai.
For the ith object in cluster A and any other object in some
other cluster B, calculate the mean (average) distance from
the ith object in cluster A to all the objects in cluster B.
Find the minimum such value with respect to all clusters. Call
this bi. The Silhouette coefficient for this ith object is given by
Si = (bi - ai)/ max(ai, bi)
An average of all the Silhouette coefficients gives the quality
of the clustering. 23
Choosing the best possible k
Elbow-Method (using Within Cluster Sum of Squares)
1.Compute clustering algorithm (e.g., k-means clustering) for
different values of k. For instance, by varying k from 1 to 10
clusters.
2. For each k, calculate the total within-cluster sum of square (wcss).
3. Plot the curve of wcss according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered
as an indicator of the appropriate number of clusters.
24
Choosing the best possible k
Elbow-Method (using Within Cluster Sum of Squares)
25
Advantages of K-Means Clustering
The k-means clustering is popular and widely adopted due to
its simplicity and ease of implementation.
It is efficient and has optimal time complexity defined by
O(ikn), where n is the number of data points, k is the number
of clusters, and i is the number of iterations.
26
Disadvantages of K-Means Clustering
The value of k is always a user input.
This algorithm is applicable only when the means are
available, and in the case of categorical data the centroids are
none other than the frequent values
The clusters identified are very sensitive to the initially
identified centers
k-means is very sensitive to outliers
27
K-means variations
• K-medoids – instead of mean, use
medians of each cluster
– Mean of 1, 3, 5, 7, 9 is5
– Mean of 1, 3, 5, 7, 1009 is205
– Median of 1, 3, 5, 7, 1009 is5
– Median advantage: not affected by extreme
values
k-Medoids
k-Medoids Algorithm
Problem with PAM
• Pam is more robust than k-means in the presence of
noise and outliers
• Pam works efficiently for small data sets but does not
scale well for large data sets.
Sampling based method,
CLARA(Clustering LARge Applications)
31
CLARA (Clustering Large Applications)
• CLARA (Kaufmann and Rousseeuw in 1990)
• It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
• Strength: deals with larger data sets than PAM
• Weakness:
– Efficiency depends on the sample size
– A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
32