F OUNDATIONS OF M ACHINE L EARNING
M.S C . IN D ATA S CIENCES AND B USINESS A NALYTICS
C ENTRALE S UP ÉLEC
Lab 7: k-Means and Spectral Clustering Algorithms
Instructor: Fragkiskos Malliaros
TA: Benjamin Maheu
December 2, 2021
1 Description
In this lab, we will study unsupervised learning techniques, focusing on two well-known clustering
algorithms: (i) k-Means and (ii) Spectral Clustering. Initially, we discuss the basic characteristics of each
algorithm and then we examine how they can be applied to identify the underlying clustering structure
of a dataset.
2 k-Means Algorithm
k-means is one of the simplest unsupervised learning algorithms that solves the well known clustering
problem. The goal of a clustering algorithm is to split the instances of the dataset into k clusters, where
each instance is assigned to the closest cluster as defined by a distance function. The main idea is to
define k centers (centroids), one for each cluster. The algorithm defines an iterative process, where the
following two steps take part at each iteration: (i) take each instance belonging to the dataset and assign
it to the nearest centroid, and (ii) re-calculate the centroids of each of the k clusters. Thus, the k centroids
change their location step by step until no more changes are done.
More formally, suppose that we are given a dataset X = {x1 , x2 , . . . , xm }, where each xi ∈ Rn .
The goal of the k-means algorithm is to group the data into k cohesive clusters, where k in as input
parameter of the algorithm. Algorithm 1 gives the pseudocode of k-means.
In the algorithm above, k is a parameter of the algorithm and corresponds to the number of clusters
we want to find; the cluster centroids µj represent our current guesses for the positions of the centers of
the clusters. To initialize the cluster centroids (in step 1 of the algorithm), we could choose k training ex-
amples randomly, and set the cluster centroids to be equal to the values of these k examples. Of course,
other initialization methods are also possible, such as the kmeans++ technique1 . To find the closest
centroid, a distance (or similarity) function should be defined, and typically the Euclidean distance is
used.
Based on this notion of similarity, the problem of clustering can be reduced to the problem of finding
appropriate centroids. This, in turn, can be expressed as the task of minimizing the following objective
1 Wikipedias lemma for k-means++: http://en.wikipedia.org/wiki/K-means++.
1
Algorithm 1 k-Means Clustering Algorithm
Input: Dataset X = {x1 , x2 , . . . , xm }, where each xi ∈ Rn and parameter k
Output: Clusters C1 , C2 , . . . , Ck (i.e., cluster assignments of each instance C = {c1 , c2 , . . . , cm })
1: Initialize cluster centroids µ1 , µ2 , . . . , µk by choosing k instances of X randomly
2: repeat
3: Assign each instance xi ∈ X to the closest centroid, i.e., ci = arg minj kxi − µj k
1 P
4: Re-compute the centroids µ1 , µ2 , . . . , µk of each cluster based on µj = x, where Cj , j =
nj x∈Cj
1, . . . , k the j-th cluster and nj the size of the j-th cluster
5: until Centroids do not change (convergence)
function:
k
X X
E(k) = kxi − µj k. (1)
j=1 xi ∈Cj
Thus, minimizing Eq. (1) is to determine suitable centroids µj such that, if the data is partitioned into
corresponding clusters Cj , distances between data points and their closest cluster centroid become as
small as possible.
The convergence of k-means algorithm is highly dependent on the initialization of the centroids.
Although the algorithm can converge, this may be to a local minimum of the objective function of Eq.
(1). One way to overcome this problem is by executing the algorithm several times, with different
initializations of the centroids.
Another issue is how to set parameter k, i.e., how to determine the number of clusters of the dataset.
Intuitively, increasing k without penalty, will always reduce the amount of error in the resulting cluster-
ing, to the extreme case of zero error if each data point is considered its own cluster (i.e., when k equals
the number of data points, m). One such method is known as the elbow rule2 . The idea is to examine and
compare the sum of squared error (SSE) given in Eq. (1) for a number of cluster solutions. In general,
as the number of clusters increases, the SSE should decrease because clusters are, by definition, smaller.
A plot of the SSE against a series of sequential cluster levels (i.e., different values) can be helpful here.
That is, an appropriate cluster solution could be defined as the one where the reduction in SSE slows
dramatically. This produces an ”elbow” in the plot of SSE against the different values of k.
2.1 Pipeline of the task
Vizualization of the dataset
Next we describe the pipeline of task contained in the kmeans/main.py 2.0
1.5
Python script. The goal here is to apply k-means on two datasets. The 1.0
0.5
first one is an artificial dataset where the data points form four dis-
2nd dimension
0.0
tinct clusters, similar to the one shown in Fig. 1. The second one is 0.5
1.0
the MNIST handwritten digits dataset that has been also used in the 1.5
2.0
supervised learning labs. The basic difference here is that we do not
2.5
3 2 1 0 1 2 3
take into account the class labels. We use a modified version of the 1st dimension
dataset (similar to the one used in Lab 5). We have applied PCA on Figure 1: Example of artificial
the data and we keep the first 8 principal components. We also keep a dataset.
sample of the data consisting of 1000 instances. Thus, the size of dataset X is 1000 × 8. Our goal is to
apply k-means clustering on X.
2 Description of the elbow rule can be found in http://www.mattpeeples.net/kmeans.html.
2
For both datasets, initially we load the data and for illustration purposes, we visualize them. For
example, in the case of the MNIST dataset we perform the steps shown below.
# Number o f i n s t a c e s and number o f p r i n c i p a l c o m p o n e n t s ( f e a t u r e s )
n i n s t a n c e s = 1000
pca features = 8
# Get t h e l a b e l s o f e a c h d i g i t
images , l a b e l s m n i s t = r e a d d a t a s e t ( n i n s t a n c e s , p c a f e a t u r e s ) ;
# Create the d a t a s e t ( data mnist ) t h a t w i l l be used in c l u s t e r i n g
# l o a d t h e PCA f e a t u r e s o f t h e t e s t d a t a s e t
d a t a m n i s t = a r r a y ( l i s t ( csv . r e a d e r ( open ( ” t e s t d a t a . csv ” , ” rb ” ) , d e l i m i t e r = ’ , ’ ) ) ) . a s t y p e ( ’ f l o a t ’ )
data mnist = data mnist [ : n instances , : pca features ] # only 8 f i r s t f e a t u r e s a r e k e p t
Then, we run k-means algorithm for different values of k. We also plot the data (two dimensions) based
on clustering results produced by the algorithm.
# Run k−means a l g o r i t h m f o r d i f f e r e n t v a l u e s o f k
k = 10
l a b e l s p r e d m n i s t = kmeans ( da t a m n is t , k )
2.2 Tasks to be done
• Fill in the code of the kmeans() function in the kmeans.py file, based on Algorithm 1.
• Run the k-means algorithm for different values of k for the two datasets and examine the quality
of the produced clusters.
• Use the elbow rule described above to determine the number of clusters k by examining the sum
of squared error (SSE) for the clusters produced for different values of k.
3 Spectral Clustering
Spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the
data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix
is provided as an input and consists of a quantitative assessment of the relative similarity of each pair
of points in the dataset.
Given a set of data points x1 , . . . , xm , ∀xi ∈ Rn and some notion of similarity sij between all pairs of
data points xi and xj , the intuitive goal of clustering is to divide the data points into several groups such
that points in the same group are similar and points in different groups are dissimilar to each other. If
we do not have more information than similarities between data points, a nice way of representing the
data is in form of the similarity graph G = (V, E). Each vertex vi in this graph represents a data point
xi . Two vertices are connected if the similarity sij between the corresponding data points xi and xj is
positive or larger than a certain threshold, and the edge is weighted by sij . The problem of clustering
can now be reformulated using the similarity graph: we want to find a partition of the graph such
that the edges between different groups have very low weights (which means that points in different
clusters are dissimilar from each other) and the edges within a group have high weights (which means
that points within the same cluster are similar to each other).
3
How to create a similarity graph
There are several popular constructions to transform a given set x1 , . . . , xm , ∀xi ∈ Rn of data points with
pairwise similarities sij or pairwise distances dij into a graph. When constructing similarity graphs the
goal is to model the local neighborhood relationships between the data points.
• k-Nearest Neighbors graph. Here the goal is to connect vertex vi with vertex vj if vj is among the
k-nearest neighbors of vi . However, this definition leads to a directed graph, as the neighborhood
relationship is not symmetric. The most common way to deal with this, is to simply ignore the
directions of the edges; that is, we connect vi and vj with an undirected edge if vi is among the
k-nearest neighbors of vj or if vj is among the k-nearest neighbors of vi . The resulting graph is
what is usually called the k-nearest neighbors graph.
• The fully connected graph. Here we simply connect all points with positive similarity with
each other, and we weight all edges by sij . As the graph should represent the local neighbor-
hood relationships, this construction is only useful if the similarity function itself models local
neighborhoods. An example for such a similarity function is the Gaussian similarity function
s(xi , xj ) = exp(kxi − xj k2 )/(2σ 2 ), where the parameter σ controls the width of the neighbor-
hoods.
The algorithm
Next we describe the pseudocode of the spectral clustering algorithm.
Algorithm 2 Spectral Clustering
Input: Dataset X = {x1 , x2 , . . . , xm }, where each xi ∈ Rn and parameter k
Output: Clusters C1 , C2 , . . . , Ck (i.e., cluster assignments of each instance C = {c1 , c2 , . . . , cm })
1: Construct the similarity graph G using one of the ways described above. Let W be the adjacency
matrix of this graph.
2: Compute the Laplacian matrix L = D − W. Matrix D corresponds to the diagonal degree matrix of
graph G (i.e., degree of each node vi (= number of neighbors) in the main diagonal).
3: Apply eigenvalue decomposition to the Laplacian matrix L and compute the eigenvectors that cor-
respond to k smallest eigenvalues. Let U = [u1 |u2 | . . . |uk ] ∈ Rm×k be the matrix containing these
eigenvectors as columns.
4: For i = 1, . . . , m, let yi ∈ Rk be the vector corresponding to the i-th row of U. Apply k-means to the
points (yi )i=1,...,m (i.e., the rows of U) and find clusters C1 , C2 , . . . , Ck .
In spectral clustering, the data is projected into a lower-dimensional space (the spectral/eigenvector
domain) where they are easily separable, say using k-means. 6
So, what is the reason to apply spectral clustering (in the
4
similarity data matrix) and not applying directly k-means
to the initial data? Typically, k-means algorithm is inter- 2
esting in finding compact clusters of convex shape, while 0
x2
on the other hand spectral clustering methods are trying
2
to identify connectivity patterns in the similarity graph. In
4
many cases, we are interested in finding clusters that are
non-convex and in this case the k-means algorithm does not 6
6 4 2 0 2 4 6
x1
behave well. Figure 2 shows an example of a dataset where
the ”natural” clusters in R2 do not correspond to convex Figure 2: Example of a dataset.
4
compact regions. Applying k-means to this dataset will extract the clusters shown in Fig. 3 (a). On the
other hand, as shown in Fig. 3 (b), applying spectral clustering, we are able to find non-convex clusters
with good connectivity properties.
6 6
4 4
2 2
0 0
x2
x2
2 2
4 4
6 6
6 4 2 0 2 4 6 6 4 2 0 2 4 6
x1 x1
(a) k-means (b) Spectral clustering
Figure 3: Results using k-means and spectral clustering algorithms.
3.1 Pipeline of the task
Next we describe the pipeline of the task contained in the spectral clustering/main.py file. The
goal is to apply spectral clustering on the artificial data shown in Fig. 2. Initially, we call the function
generateData() contained in the generateData.py file, to create the artificial data. Then, we use
the build-in implementation of Python for the k-means algorithm to cluster this dataset (you can also
use your own implementation) and we plot the results.
# Number o f c l u s t e r s
k = 3
# C l u s t e r u s i n g kmeans
c e n t r o i d s , l a b e l s = kmeans2 ( data , k )
Then, we create the similarity graph, finding the N closest neighbors of each data instance, using the Eu-
clidean distance. Notice that, after finding the closest neighbors using the findClosestNeighbours()
function, we can directly form the adjacency matrix W of the graph. Here we create a binary (0 or 1)
adjacency matrix (if two points are neighbors, we add the corresponding edge with weight 1).
N = 10
c l o s e s t N e i g h b o u r s = f i n d C l o s e s t N e i g h b o u r s ( data , N)
# Create adjacency matrix
W = z e r o s ( ( data . shape [ 0 ] , data . shape [ 0 ] ) )
f o r i in range ( data . shape [ 0 ] ) :
f o r j in range ( n ) :
W[ i , c l o s e s t N e i g h b o u r s [ i , j ] ] = 1
W[ c l o s e s t N e i g h b o u r s [ i , j ] , i ] = 1
Having the similarity graph (described by the adjacency matrix W), we can apply the spectral clustering
algorithm, finding the underlying clustering structure.
# Perform s p e c t r a l c l u s t e r i n g
5
l a b e l s = s p e c t r a l C l u s t e r i n g (W, k )
3.2 Tasks to be done
• Fill in the code of the spectralClustering() function in the spectralClustering.py file
to implement the spectral clustering algorithm as described in Algorithm 2. Note that, the adja-
cency matrix W (step 1 of the algorithm) has been already created.
• Run the algorithm and reproduce the clustering results shown in Fig. 3 (b).
References
[1] Christopher M. Bishop. ”Pattern Recognition and Machine Learning”. Springer-Verlag New York,
Inc., 2006.
[2] Tom M. Mitchell. ”Machine learning”. Burr Ridge, IL: McGraw Hill 45, 1997.
[3] Ulrike Von Luxburg. ”A tutorial on spectral clustering”. Statistics and computing, Springer, 2007.