CHAPITRE IV: CLUSTERING
In this chapter we introduce a model based on unsupervised learning: the clustering.
4.1 INTRODUCTION TO CLUSTERING:
In general, clustering is the task of grouping, in an unsupervised way (i.e. without the
prior help of an expert), a set of objects or more broadly data, in such a way that the objects in
the same group, called a cluster, are closer (in the sense of a chosen (dis)similarity criterion)
to each other than those in other group. This is a key task in exploratory data mining, and a
statistical data analysis technique widely used in many fields, including: machine learning,
pattern recognition, signal and image processing, information retrieval and so on.
4.2 DEFINITION OF CLUSTERING :
Let D = {Xi ∈ Rd, i=1, ..., N} be a set of observations described by d attributes. Rd is the
representation space. Each variable xi is represented by a vector of dimension d.
Definition : Clustering.
Clustering, also known as unsupervised automatic classification, is an exploratory data
analysis technique aimed at structuring data into homogeneous classes (clusters): The aim is
to group the data in set D into clusters or classes such that the data in a cluster are as similar
as possible.
Figure 4.1 illustrates the purpose of clustering. It consists of exploring unlabelled data to form
clusters.
Unlabelled data Clustered data
Fig 4.1 Aim of clustering: grouping data in clusters
48
4.3 WHAT DATA ?
The data processed in automatic classification can be images, signals, texts, etc. Often,
the data is multidimensional: each datum is made up of several variables (descriptors), as in
the case of the representation of a colour. In the case of standard multidimensional data, each
piece of data is of dimension d (i.e. a datum in the space Rd) and may be labelled if its class is
known. The n data items can therefore be modelled as a matrix of n elements. In the case of a
(colour) image, the image containing n rows and m columns and therefore n x m colour
pixels, each pixel being made up of three RGB components.
Another example of various data that can be used for clustering is Quinlan's “weather” dataset
(see section 1.8). In this case, data is numeric (for Temperature and Humidity), nominal (for
Sky) or Boolean (for Wind).
4.4 WHAT IS A CLUSTER ?
Here is a definition of a cluster.
Definition: Cluster.
A cluster (group or class) is a set of data formed by homogeneous data. In other words, data
between which there is a certain similarity.
Clusters can be constructed using a similarity function or a dissimilarity function. In the first
case, clusters are formed by elements between which there is greater similarity. In the second
case, the elements in a cluster are less dissimilar.
Often, dissimilarity measures are used to create clusters. These measures can be distances,
densities or probabilities.
4.5 CLUSTERING APPLICATIONS
Table 4.1 gives examples of clustering applications. For each field, it specifies the data
processed and the meaning of the projected clusters.
Table 4.1 Clustering applications
Field Data Form Targeted clusters
Text Mining Texts, Related texts
Mails Automatic documents classification
Web Mining Texts or images Related web pages
Bio-Informatique Genes Similar genes
Marketing Customer information, Customer segmentation
products purchased
Image segmentation Images Homogeneous zones in the image
49
4.6 HOW MANY CLUSTERS ?
The number of clusters (k) can be assumed to be fixed (given by the user). This is the
case, for example, if we are interested in classifying images of handwritten numbers. We
know that number of classes is 10 (numbers: 0, 1, ..., 9), or handwritten letters (number of
classes = number of characters in the alphabet), etc. However, there are criteria, known as
model selection criteria, which can be used to choose the optimum number of classes in a
dataset [11].
4.7 CONCEPTS OF SIMILARITY, DISSIMILARITY AND DISTANCE :
Two pieces of data are said to be similar if a certain measure can be found that enables the
closeness (similarity) of the two pieces of data to be assessed.
Let us note SM the similarity measure: The similarity between the data increases as the SM
value increases.
Data can also be compared using a dissimilarity measure (denoted DM): The similarity
between the data increases as the DM value decreases. In fact, DM is represented by a
measure of distance.
Before presenting the main distance measures used in research, let's remember that each
datum Xi is represented by a vector of dimension d.
Notation: each datum Xi ∈ Rd
Several measures of the distance can be found in the literature to calculate the distance
between two datums Xi and Xj. We will limit ourselves to Minkowsky, Euclidean, and
Manhattan.
Minkowsky distance: It is defined by the following expression, where q is a
parameter. It is a generalisation of the Euclidean, Manhattan and Chebyshev distances.
(1)
Euclidean distance: It is obtained from the Minkowsky distance, by taking q=2.
(2)
Manhattan distance: It is obtained from the Minkowsky distance, by taking q=1.
(3)
50
The Euclidean distance is by far the most widely used in the literature.
4.8 CLUSTERING QUALITY EVALUATION:
Several measures have been proposed to evaluate the quality of a clustering. These include
intra-cluster inertia and inter-cluster inertia.
4.8.1 Intra-cluster inertia:
Intra-cluster inertia measures the concentration of cluster points around the centre of gravity
of tha data. There is an interest in minimising intra-cluster inertia.
Each cluster Ck is characterised by its centre of gravity µk and inertia Ik .
centre of gravity:
(4)
with = cardinality ( )
Inertia Ik :
(5)
The inertia of a cluster is therefore equal to the variance of the points in that cluster.
Intra-cluster inertia is expressed as:
(6)
Intra-cluster inertia is therefore equal to the sum of the inertia of all the clusters.
4.8.2 The inter-cluster inertia:
Inter-cluster inertia measures the distance between cluster centres. There is an interest in
maximising inter-cluster inertia.
Let µ the centre of gravity of the entire data:
(7)
Where N is the total number of data.
The centres of gravity of the clusters also form a cloud of points characterised by the inter-
cluster inertia defined by the following expression:
51
(8)
According to what we have just presented, in order to have good quality clustering, we need
to minimise intra-cluster inertia and maximise inter-cluster inertia.
Example: Figure 4.1 below shows two cases of clustering obtained on the same dataset.
However, the clustering in figure B is better than that in figure A: there is low intra-cluster
inertia and high inter-cluster inertia.
Fig 4.1. intra-cluster inertia et inter-cluster inertia
4.9 CLUSTERING ALGORITHMS:
There are several clustering algorithms using different approaches. The most important
categories include : hierarchical clustering, density-based clustering and partition-based
clustering.
Hierarchical clustering: builds a multi-level hierarchy of clusters by creating a tree of
clusters, called a dendogram.
Desity-based clustering: groups points close together in areas of high density.
Algorithms based on this approach include Dbscan (Density-based Spatial Clustering
of Applications with Noise).
Clustering based on partitioning: the data is partitioned into k distinct clusters
according to the distance to the centroid of a cluster. Algorithms based on this
approach include Kmeans. This is the algorithm that will be presented in the next
section.
4.9.1 Kmeans algorithm
Kmeans (or moving centre algorithms) is one of the clustering algorithms based on
partitioning.
52
This algorithm was designed in 1957 at Bell Laboratories by Stuart P. Lloyd. It was not
presented to the general public until 1982. In 1965, Edward W. Forgy had already published
an almost similar algorithm, which is why K-means is often referred to as the Lloyd-Forgy
algorithm.
Its principle is shown in the pseudocode below.
Algorithm 4.1 Kmeans
Algorithm Kmeans
Input : Data set D .
k : desired numbers of clusters.
Sortie : k formed clusters.
Begin
Randomly initialise cluster centres µ1, µ2, … µk.
Repeat
Assigning each point to its nearest cluster
Ck ← xi where l = arg min distance(xi, µk)
Recalculate the centre µk of each cluster
with = cardinality ( )
until the centres of gravity (µ1, µ2, … µk) become stable
End.
The algorithm takes as input the data to be processed and the number of clusters k. The output
is k clusters.
Initially, the algorithm randomly initializes the centres of the k clusters. It then iteratively
calculates the distance between each data item d in the dataset and each of the k centres of
gravity. Data item d is then assigned to the cluster whose centre is closest. At the end of the
iteration, all the centres of gravity are recalculated.
The algorithm ends when the centres of gravity are stable.
4.9.2 Example of clustering with Kmeans
We will apply kmeans algorithm to the following example: The following table shows data on
4 products (A, B, C and D) described by 2 attributes: X and Y.
Table 4.2 Kmeans application example data
Article X Y
A 1 1
B 2 1
C 4 3
D 5 4
53
Task: Plot the data and apply Kmeans with k=2.
µ1 and µ2 are the centres of clusters C1 and C2.
Let's randomly take µ1=A and µ2=B as the initial centres. Here is the graph representing the
initial state.
Fig 4.2. Application of Kmeans. Random selection of initial cluster centres of gravity
The first iteration:
Calculation of the Euclidean distance of each datum (point) from the centres of gravity.
Dist(A,µ1) = =0
Dist(A, µ2) =
=1
Dist(B, µ1) =
=1
Dist(B, µ2) =
=0
Dist(C, µ1) =
= 3.60
Dist(C, µ2) =
= 2.83
Dist(D, µ1) =
=5
Dist(D, µ2) =
= 4.24
Based on these calculations, we assign:
54
A to cluster C1,
B to cluster C2,
C to cluster C2,
D to cluster C2.
Here is the composition of the two clusters:
C1={A} et C2 = {B, C, D}
The following figure shows the temporary clusters obtained after this first iteration.
Fig 4.3. Kmeans application. After a first iteration.
Recalculating cluster centres of gravity :
µ1 = (1, 1)
µ2 = ((2+4+5)/3, (1+3+4)/3) = (3.66, 2.66)
The second iteration :
Calculate the Euclidean distance of each datum (point) from the centres of gravity.
Dist(A,µ1) = =0
Dist(A, µ2) =
= 3.14
Dist(B, µ1) =
=1
55
Dist(B, µ2) =
= 2.35
Dist(C, µ1) =
= 3.60
Dist(C, µ2) =
= 0.74
Dist(D, µ1) =
=5
Dist(D, µ2) =
= 1.90
According to these calculations :
A remains in cluster C1,
B changes cluster and is assigned to C1,
C and D remain in cluster C2
This gives the following configuration:
C1={A, B} et C2={C , D}
Recalculating cluster centres of gravity :
µ1 = ((1+2)/2, (1+1)/2) = (1.5, 1)
µ2 = ( (4+5)/2, (3+4)/2) = (4.5, 3.5)
Fig 4.5. Kmeans application. After a second iteration.
56
Third iteration:
Calculating distances :
Dist(A,µ1) = = 0.50
Dist(A, µ2) =
= 4.0
Dist(B, µ1) =
= 0.50
Dist(B, µ2) =
= 3.54
Dist(C, µ1) =
= 3.20
Dist(C, µ2) =
= 0.71
Dist(D, µ1) =
= 4.61
Dist(D, µ2) =
= 0.71
According to these calculations, there is no change in the clusters:
C1={A, B} et C2={C , D}
The centres of gravity are stable (they remain unchanged). The algorithm stops.
Fig 4.6 Kmeans application. Final situation.
57