Clustering
Gilles Gasso
INSA Rouen - ASI Departement
Laboratory LITIS
18 septembre 2019
Gilles Gasso Clustering 1 / 29
Plan
1 Introduction
Notion of dissimilarity
Quality of clusters
2 Methods of clustering
Hierarchical clustering
Principle and algorithm
K-means
Principle and algorithm
Variations
Gilles Gasso Clustering 2 / 29
Introduction
Introduction
D = {x i ∈ Rd }N
i=1 : set of training samples
Goal : structure the data into homogeneous categories
Group the samples into clusters so that the data in a cluster be as
similar as possible
Clustering ≡ unsupervised learning.
Clustering images
https://courses.cs.washington.edu/courses/cse416/18sp/slides/L11_kmeans.pdf
Gilles Gasso Clustering 3 / 29
Introduction
Applications
Field Data type Clusters
Text mining Texts Close texts
E-mails Automatic folders
Graph mining Graphs Social sub-networks
BioInformatics Genes Resembling genes
Marketing Client profile, Customer
purchased products segmentation
Image Images Homogeneous areas
segmentation in an image
Web log analysis Clickstream User profile
http:
//images2.programmersought.com/267/fc/fc00092c0966ec1d4b726f60880f9703.png
Gilles Gasso Clustering 4 / 29
Introduction
What is clustering ?
How to define similarity or dissimilarity between samples
How to characterize a cluster ?
Number of clusters
Which algorithms of clustering ?
How to assess a clustering result
https://image.slidesharecdn.com/k-means-130411020903-phpapp01/95/k-means-clustering-algorithm-4-638.jpg?
cb=1365646184
Gilles Gasso Clustering 5 / 29
Introduction Notion of dissimilarity
Dissimilarity measure (1)
Concept of dissimilarity
Dissimilarity is a function of the pair (x 1 , x 2 ) with a value in R+ such that
D(x 1 , x 2 ) = D(x 2 , x 1 ) ≥ 0 and D(x 1 , x 2 ) = 0 ⇒ x 1 = x 2
Measure of dissimilarity : distance D(x 1 , x 2 ) between 2 points x 1 and x 2
Minkoswski’s distance : D(x 1 , x 2 )q = dj=1 |x1,j − x2,j |q
P
Euclidean distance
Pd corresponds to q = 2 :
D(x 1 , x 2 )2 = j=1 (x1,j − x2,j )2 = (x 1 − x 2 )> (x 1 − x 2 )
Pd
Manhattan distance (q = 1) : D(x 1 , x 2 ) = j=1 |x1,j − x2,j |
Metric linked to the positive definite matrix W :
D 2 (x 1 , x 2 ) = (x 1 − x 2 )> W(x 1 − x 2 )
Gilles Gasso Clustering 6 / 29
Introduction Notion of dissimilarity
Dissimilarity measure (2)
The d-dimensional samples x 1 and x 2 are of discrete values
Compute the contingency matrix A(x 1 , x 2 ) ∈ IRd×d
> >
x1 = 0 1 2 1 2 1 and x 2 = 1 0 2 1 0 1
0 1 0
A(x 1 , x 2 ) = 1 2 0
1 0 1
Hamming’s distance : number of indexes where the 2 vectors differ
d
X d
X
D(x 1 , x 2 ) = aij
i=1 j=1,j6=i
Example : D(x 1 , x 2 ) = 3
Gilles Gasso Clustering 7 / 29
Introduction Notion of dissimilarity
Dissimilarity between clusters (1)
Distance D(C1 , C2 ) between 2 clusters C1 and C2
minimum diameter (nearest neighbor) :
Dmin (C1 , C2 ) = min {D(x i , x j ), x i ∈ C1 , x j ∈ C2 }
maximum diameter :
Dmax (C1 , C2 ) = max {D(x i , x j ), x i ∈ C1 , x j ∈ C2 }
Minimum diameter Maximum diameter
Gilles Gasso Clustering 8 / 29
Introduction Notion of dissimilarity
Dissimilarity between clusters (2)
Distance D(C1 , C2 ) between 2 clusters C1 and C2
average distance :
P P
x i ∈C1 x j ∈C2 D(x i , x j )
Dmoy (C1 , C2 ) =
n1 n2
Ward’s distance (between
q centres) :
n1 n2
DWard (C1 , C2 ) = n1 +n2 D(µ1 , µ2 )
Average distance Distance between centroids
Gilles Gasso Clustering 9 / 29
Introduction Quality of clusters
Characteristics of a good clustering
Every cluster C` is characterized by :
His centroid : µ` = N1` i∈C` x i with N` = card(C` )
P
His inertia : J` = i∈C` D 2 (x i , µ` )
P
measures how close are the points around µ` . The lower J` , the smaller
is the dispersion of points around µ`
D 2 (x i , µ` ) = i∈C` J`
P P P
Within distance : Jw = ` i∈C`
Let µ be the mean of the samples : µ = N1 i x i
P
2 (µ , µ)
P
Between distance : Jb = ` N` D `
measures the "distance" between the clusters. The greater the inertia,
the more the clusters are well separated
Gilles Gasso Clustering 10 / 29
Introduction Quality of clusters
Illustration
C1
g1
C3
g3
C2
g g
g4 C4
g2
Total inertia of the points = Inertia Intra-cluster + Inertia Inter-cluster
A good clustering . . .
is the one which minimizes the within distance and maximizes the between
distance
Gilles Gasso Clustering 11 / 29
Methods of clustering Hierarchical clustering
Approaches of clustering
Hierarchical clustering
K-means clustering
Gilles Gasso Clustering 12 / 29
Methods of clustering Hierarchical clustering
Hierarchical clustering : principle
Principle : bottom up approach
Each point or cluster is gradually "agglomerated" to the nearest cluster.
Algorithm
Initialization :
Each sample represents a cluster,
Compute the pairwise distance matrix M with Mij = D(x i , x j )
Repeat
Select from M the two closest clusters CI and CJ
Merge CI and CJ into the cluster CG
Update M by computing the distance between CG and the remaining
clusters
Until all samples merged into one cluster
Gilles Gasso Clustering 13 / 29
Methods of clustering Hierarchical clustering
Hierarchical clustering : illustration
C10
Hiérarchie i C8 C10
(indicée) C5
C9 C2
C4
C9
C8
C7 C6
C6 C3 C1
C7
C5
C4
C3 C2 C1
Samples Dendogram Clustering
Dendrogram = representation of the successive fusions
Height of a cluster in the dendrogram = distance betweent the 2
clusters before they merge
Gilles Gasso Clustering 14 / 29
Methods of clustering Hierarchical clustering
Linkage criterion
Common metrics used for merging two clusters
Single linkage (minimum) based on Dmin (C1 , C2 )
tends to produce some large clusters (by chaining effect)
sensitivity to noised data
Complete linkage (maximum) based on Dmax (C1 , C2 )
tends to produce specific clusters (only very close clusters are merged
together)
sensitivity to noised data
Average linkage based on Dmoy (C1 , C2 )
tendency to produce classes with close variance
Ward distance DWard (C1 , C2 )
tends to minimize within variance of clusters being merged
Gilles Gasso Clustering 15 / 29
Methods of clustering Hierarchical clustering
Influence of linkage criterion (1)
Données (métrique : dist. Eucl.) Saut minimal Saut maximal
4
D
i i
1,1 4,0
C
3
E 0,9 2,8
2
F
B 0,7 1,7
1
A
0,5 0,5
0
0 1 2 3 4 F E A B C D C D A B E F
Clustering result may change w.r.t the selected linkage measure
Gilles Gasso Clustering 16 / 29
Methods of clustering Hierarchical clustering
Influence of linkage criterion (2)
Gilles Gasso Clustering 17 / 29
Methods of clustering K-means
Approaches of clustering
Hierarchical clustering
K-means clustering
Gilles Gasso Clustering 18 / 29
Methods of clustering K-means
Clustering by data partitioning
Goal
D = {x i ∈ Rd }N
i=1 a set of training samples
Search of a partition in K clusters (with K < N)
Direct approach
Build all possible partitions
Retain the best partition among them
NP-hard problem
The number of possible partitions increases exponentially :
#Clusters = K1! K K −k C K k N .
P
k=1 (−1) k
For N = 10 and K = 4, we have 34105 possible partitions !
Gilles Gasso Clustering 19 / 29
Methods of clustering K-means
Data partitioning
Workaround solution
Determine the K clusters {C` }K K
`=1 and their centers {µ` }`=1 that
minimize the cluster within-distance Jw
K X
X
min kx i − µ` k2
{C` }K K
`=1 ,{µ` }`=1 `=1 i∈C`
Global solution : NP-hard problem
A local solution (not necessarily the optimal partition) can be attained
using a simple algorithm : K-means
Gilles Gasso Clustering 20 / 29
Methods of clustering K-means
K-means clustering
A popular clustering algorithm
Principle
Assume the centroids µ` , ` = 1, · · · , K are fixed
assign any point x i to one and only one cluster
x i is assigned to the closest cluster Ck (according to the distance
between x i and the clusters’ center µ1 ` )
Given the clusters C` , ` = 1, · · · , K ,
estimate their centers µ` , ` = 1, · · · , K
Repeat the previous steps until convergence
Gilles Gasso Clustering 21 / 29
Methods of clustering K-means
K-Means : illustration
Clustering in K = 2 classes
Data Initialization Iteration 1
La vérité vraie Initialisation Clusters obtenus à l’iteration 1
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2
−3 −2 −1 0 1 2 3 4 5
−2 −2
−4 −2 0 2 4 6 −4 −2 0 2 4 6
Iteration 2 Iteration 3 Iteration 5
Clusters obtenus à l’iteration 2 Clusters obtenus à l’iteration 3 Clusters obtenus à l’iteration 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−4 −2 0 2 4 6 −4 −2 0 2 4 6 −4 −2 0 2 4 6
Gilles Gasso Clustering 22 / 29
Methods of clustering K-means
K-Means : Llyod’s algorithm
Initialize the centers µ1 , · · · µK
Repeat
Assign each point x i to the closest cluster
∀i ∈ {1, · · · , N} si ← arg min kx i − µ` k2 and Ck = {i : si = k}
`
Compute the center µk of each cluster
1 X
µ` = xi with N` = card(C` )
N`
i∈C`
Until convergence
Gilles Gasso Clustering 23 / 29
Methods of clustering K-means
K-Means : example
Initial centers : plain yellow squares
Gilles Gasso Clustering 24 / 29
Methods of clustering K-means
K-Means : example
=⇒ Different initializations lead to different partitions !
Gilles Gasso Clustering 25 / 29
Methods of clustering K-means
K-Means : remarks and limitations
The criterion Jw decreases at each iteration.
The algorithm converges to (at least) a local minimum of Jw
Initialization of µk :
select randomly within the range of definition of x i
select randomly among x i
Different initializations can lead to different clusters (convergence to
local minimum)
Gilles Gasso Clustering 26 / 29
Methods of clustering K-means
K-Means : some issues
Number of clusters
Difficult to asses the number of clusters because of a lack of a ground
truth to compare with
Fixed a priori (e.g. : we want to split customers into K segments)
Use the "elbow trick" on the variation of Jw (K ) w.r.t K
Use ad-hoc metrics such as silhouette score
Gilles Gasso Clustering 27 / 29
Methods of clustering K-means
Sequential K-Means
Adaptation of K-means for streaming samples
Initialize µ1 , · · · µK
Repeat
acquire x
assign x to the nearest cluster
C` ← x such that ` = arg min kx − µk k2
k
increment n`
recompute the center µ` of this cluster
1
µ` = µ` + (x − µ` )
n`
Gilles Gasso Clustering 28 / 29
Methods of clustering K-means
Conclusion
Clustering : unsupervised learning
Data grouping into homogeneous clusters
The number of clusters is a hyper-parameter that depends on the
application but can be determined on the basis of some metrics such
as silhouette score
Several algorithm : hierarchical clustering, K-means, but also DBScan,
Spectral clustering, . . .
Gilles Gasso Clustering 29 / 29