CSE3506 - Essentials of Data Analytics
Facilitator: Dr Sathiya Narayanan S
Assistant Professor (Senior)
School of Electronics Engineering (SENSE), VIT-Chennai
Email: sathiyanarayanan.s@vit.ac.in
Handphone No.: +91-9944226963
Winter Semester 2020-21
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 17
Suggested Readings
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani,
“An Introduction to Statistical Learning with Applications in R”,
Springer Texts in Statistics, 2013 (Facilitator’s Recommendation).
Alpaydin Ethem, “ Introduction to Machine Learning”, 3rd Edition,
PHI Learning Private Limited, 2019.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 2 / 17
Contents
1 Module 3: Clustering
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 3 / 17
Module 3: Clustering
Topics to be covered in Module-3
Introduction to Clustering
K -Means Clustering
K -Medoids Clustering
Hierarchical Clustering
Applications of Clustering
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 4 / 17
Module 3: Clustering
Introduction to Clustering
Clustering algorithms group samples/data points/features/objects
into clusters by natural association according to some similarity
measures (say Euclidean distance).
Clustering serves two purposes: (i) understanding the structure in the
data (i.e. data exploration) and (ii) finding the similarities between
instances (data points) and thus grouping them.
After grouping the data points, the groups can be named and their
attributes can be defined (using domain knowledge). This paves the
way for supervised learning. In this case, clustering becomes a part of
preprocessing stage.
In most cases, labelling the data is costly and therefore, preceding a
supervised learning (regression or classification) with unsupervised
learning (clustering) is advantageous.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 5 / 17
Module 5: Clustering
K -Means Clustering
Partitioning based clustering methods group data points based on
their similarity and characteristics.
K -means and K -medoids (partitioning around medoids) are the two
popular partitioning based methods.
K -means algorithm partitions the data points into K clusters and the
value of K should be known apriori (i.e. it needs to be specified
beforehand).
K -means algorithm is based on the minimization of the sum of
squared distances.
The K -means procedure attempts to form clusters such that the
intracluster similarity is high and intercluster similarity is low. The
similarity of a cluster is determined based on the centroid (i.e. mean
value) of the data points in it.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 6 / 17
Module 3: Clustering
K -Means Clustering - Procedure
Step 1: Initialize the iteration count: n = 1; and arbitrarily choose K
samples as initial cluster centres: z1 (n), z2 (n), ..., and zK (n).
Step 2: Distribute the pattern samples x among the K clusters according
to the following rule:
x ∈ Gi (n) if kx − zi (n)k < kx − zj (n)k for j = 1, 2, ..., K ; j 6= i.
Step 3: Compute zi (n + 1) for i = 1, 2, .., K :
1 X
zi (n + 1) = x
Ni
x∈Gi (n)
where Ni is the number of pattern samples assigned to class Gi (n).
Step 4: If ∀i zi (n + 1) = zi (n), then STOP. Otherwise go to Step 2.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 7 / 17
Module 3: Clustering
K -Means Clustering
Advantages of K -means clustering: (i) simple and efficient ; and (ii)
low computational complexity.
Drawbacks: (i) K must be known/decided; and (ii) final clusters
usually depend on the order of presentation of training samples and
the initial cluster centres.
Question 3.1
Apply K -means clustering to cluster the following samples/data points:
(0,0), (0,1), (1,0), (3,3), (5,6), (8,9), (9,8) and (9,9).
Fix K = 2 and choose (0,0) and (5,6) as the initial cluster centres.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 17
Module 3: Clustering
K -Means Clustering
The ‘elbow method’ (a heuristic approach) for determining K : Plot the
‘explained variation’ (say, the ‘distortion’) as a function of K , and pick the
‘elbow’ or ’knee’ of the curve as the value of K (as shown in Figure 1).
Figure 1: Elbow method using distortion.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 9 / 17
Module 5: Clustering
K -Medoids Clustering
In K -medoids clustering, each cluster is represented by a cluster
medoid which is one among the data points in the cluster.
The medoid of a cluster is defined as a data point in the cluster
whose average dissimilarity to all the other data points in the cluster
is minimal. As ‘medoid’ is the most centrally located point in the
cluster, the cluster representatives can be interpreted in a better way
(compared to K -means).
In K -medoids can use arbitrary dissimilarity measures, whereas
K -means generally requires Euclidean distance for better
performance. In general, K -medoids use Manhattan distance and
minimizes the sum of pairwise dissimilarities.
As in the case of K -means, the value of K needs to be specified
beforehand. An heuristic approach, the ‘silhouette method’ , can be
used for determining the optimal value of K .
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 10 / 17
Module 5: Clustering
K -Medoids Clustering
As the K -medoids clustering problem is NP-hard to solve exactly,
many heuristic approaches/solutions exist. The most common
approach is the Partitioning Around Medoids (PAM) algorithm.
PAM algorithm has 2 phases: BUILD phase and SWAP phase.
The BUILD phase greedily selects K data points from the available
data points and initialize them as cluster medoids.
The SWAP phase associates each data point to the closest medoid
and SWAPS a cluster medoid with a non-medoid data point in the
cluster if the cost of the configuration decreases.
PAM is faster than exhaustive search and being a greedy search, it
may not find the optimum solution.
K -medoids clustering is more robust (i.e. less sensitive to outliers and
noise) compared to K -means.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 11 / 17
Module 3: Clustering
Hierarchical Clustering
Hierarchical clustering procedure groups similar data points into
clusters. It can be performed with either raw data (i.e. data points)
or a similarity matrix. When raw data is provided, a similarity matrix
S should be computed:
1
Si,j =
di,j + 1
where di,j is the Euclidean distance between data points i and j. In
the recent years, many other distance metrics have been developed.
An agglomerative hierarchical clustering is an iterative, 2-step
procedure that starts by considering each data point as a separate
cluster. In each iteration, it executes the following steps: (i)
identifying the two clusters that are closest together, and (ii) merging
the two most similar clusters. This iterative process continues until all
the clusters are merged together.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 12 / 17
Module 3: Clustering
Hierarchical Clustering
Click here for more details (particularly an illustration) on hierachical
clustering.
The dendrogram obtained at the end of hierarchical clustering shows
the hierarchical relationship between the clusters.
After completing the merging step, it is necessary to update the
similarity matrix. The updation can be based on (i) the two most
similar parts of a cluster (single-linkage), (ii) the two least similar bits
of a cluster (complete-linkage), or (iii) the center of the clusters
(mean or average-linkage). Refer Figure 2.
The choice of similarity or distance metric and the choice of linkage
criteria are always application-dependent.
Hierarchical clustering can also be done by initially treating all data
points as one cluster, and then successively splitting them. This
approach is called the divisive hierarchical clustering.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 13 / 17
Module 3: Clustering
Hierarchical Clustering
Figure 2: Three linkage types used in hierarchical clustering. Source:
https://www.dexlabanalytics.com/blog/hierarchical-clustering-foundational-
concepts-and-example-of-agglomerative-clustering
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 14 / 17
Module 3: Clustering
Question 3.2
Consider the similarity matrix given below.
Determine the hierarchy of clusters created by
(a) the single-linkage clustering algorithm, and
(b) the complete-linkage clustering algorithm.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 15 / 17
Module 3: Clustering
Applications of Clustering
Clustering analysis finds its application in market research, image
processing, etc.
Consider Customer Relationship Management for example.
Assume the customers of a company are defined in terms of their
demographic attributes and transactions with the company. If those
customers are grouped into K clusters, then a better understanding of
customer base is possible. Based on this understanding the company
shall adapt different strategies for different types of customers. The
company shall also identify unique customers (those who don’t fall in
any large group) and develop strategies for them. For example,
’churning’ customers who require immediate attention.
In Image Segmentation, clustering can be used to group the pixels
that ‘belong together’ (say, foreground and background pixels).
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 16 / 17
Module 3: Clustering
Module-3 Summary
Clustering: grouping data points based on similarity measures
Partitioning based methods: K -means and K -medoids
The elbow method and the silhouette method: heuristic approaches
for deciding the optimal value of K for partition based methods
Hierarchical clustering: single-linkage, complete-linkage, etc.
Dendogram: shows the hierarchical relationship between the clusters
Applications of clustering: customer relationship management, image
segmentation, etc.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 17 / 17