Clustering Algorithms
Spring 2025, BSSE F221
Dr. Umara Zahid
Introduction
• Unsupervised Learning Approach
• A cluster is a subset of data which are similar
• Clustering divides the dataset into groups such as:
• The members of each group are as similar as possible to one another
• Different groups are as dissimilar as possible from one another
• Clustering can uncover previously undetected relationships in a
dataset
• Applications of cluster analysis:
• In businesses discover and characterize customers’ segments for marketing
• In biosciences, cluster analysis is used for distinguishing plants and animals
Clustering
Clustering Algorithms
Hierarchical Partitive
Self Organizing
Agglomerative Divisive K-means
Maps
Requirements of a Good Clustering
Method
• The ability to discover some or all of the hidden clusters
• Within cluster similarity and between cluster dissimilarity
• Ability to deal with various types of attributes
• Can deal with noise and outliers
• Can handle high dimensionality
• Scalable, Interpretable, and usable
Similarity Measuring Methods
• To determine similarities among
objects/ points distance functions are
used (as we used in Knn)
• A distance function returns a lower
value for pairs of objects that are more
similar to one another
Hierarchical Clustering
• Clusters with predetermined ordering from top to bottom
• Two kinds: Divisive and Agglomerative
Intuitive Explanation
Divisive
• Assign all the observations to a single
cluster and then partition the cluster into
two least similar clusters
• Proceed recursively on each cluster until
there is one cluster for each observation
Agglomerative
• Assign each observation to its own cluster
• Compute similarity among individual
clusters and join them based on small
distance
• Repeat until all data points are joined in a
single cluster
Proximity Matrix
• Before clustering is performed, it is required to determine the
proximity matrix containing the distance between each point using a
distance function
• Then, the matrix is updated to display the distance between each
cluster
• The following methods differ in how the distance between each
cluster is measured:
• Single Linkage
• Complete Linkage
• Average Linkage
• Minimum Variance (Ward’s method)
• Centroid Method
Single Linkage
• The distance between two clusters is defined as the shortest distance
between two points in each cluster
• The distance between cluster “Blue” and “Red” is equal to the length
of the arrow between their two closest points
Complete Linkage
• The distance between two clusters is defined as the Longest distance
between two points in each cluster
• The distance between cluster “r” and “s” is equal to the length of the
arrow between their two furthest points
Average Linkage
• The distance between two clusters is defined as the average distance
between each point in one cluster to every point in another cluster
• The distance between clusters “r” and “s” is equal to the average
length of each arrow connecting the points of one cluster to the other
Ward’ and Centroid Methods
• Ward’s Method (Minimum Variance)
• The error sum of squares between the two clusters over all of the data points
is used as the distance
• Centroid Method
• The center of the data points is used to determine the average distance
between clusters
K-means Clustering
• The k-means algorithm is an unsupervised machine learning technique used for
clustering similar data points together. The algorithm partitions a dataset into k
clusters, where each cluster represents a group of data points that are similar to each
other in some way.
• The intuitive working of the k-means algorithm can be summarized in the following
steps:
1. Initialization: Choose the number of clusters (k) and randomly initialize the centroids
of each cluster. A centroid is simply the mean of all the data points in the cluster.
2. Assignment: Assign each data point to the closest centroid based on their distance
to the centroid. The distance can be calculated using a variety of distance measures,
such as Euclidean distance or Manhattan distance.
3. Recalculation: Recalculate the centroids of each cluster based on the mean of all the
data points assigned to it.
4. Repeat: Repeat steps 2 and 3 until the centroids no longer move or a specified
number of iterations is reached.
A working Example
• To illustrate this algorithm, consider the following example of a
dataset consisting of points in a two-dimensional space:
[(2, 3), (3, 5), (2, 4), (7, 6), (8, 7), (9, 6), (11, 8), (12, 9)]
• Let’s say we want to cluster these points into two groups (k=2). We
can start by randomly initializing the centroids of each cluster, for
example:
• centroid1 = (2, 3)
• centroid2 = (7, 6)
Continued…
• Next, we assign each data point to the closest centroid based on their distance. In this
case, we can calculate the Euclidean distance between each point and each centroid:
• (2, 3) centroid1: 0, centroid2: 5.8
• (3, 5) centroid1: 2.2, centroid2: 4.1
• (2, 4) centroid1: 1, centroid2: 5.3
• (7, 6) centroid1: 5.8, centroid2: 0
• (8, 7) centroid1: 7.2, centroid2: 1.4
• (9, 6) centroid1: 7.6, centroid2: 2
• (11, 8) centroid1: 10.29, centroid2: 4.4
• (12, 9) centroid1: 11.6, centroid2: 5.8
• Based on these distances, we can assign each point to the closest centroid:
• Cluster 1: [(2, 3), (3, 5), (2, 4)]
• Cluster 2: [(7, 6), (8, 7), (9, 6), (11, 8), (12, 9)]
Continued…
• Next, we recalculate the centroids of each cluster based on the mean
of all the data points assigned to it:
• centroid1 = (2.33, 4)
• centroid2 = (9.4, 7.2)
• We then repeat the assignment and recalculation steps until the
centroids no longer move or a specified number of iterations is
reached.
• In k-means clustering, the centroid of a cluster is the mean of all the
data points assigned to that cluster. It is entirely possible that, after
recalculating the centroid, the new centroid does not exist in the
original dataset.
Is this a problem?
• No, this is expected behavior in standard k-means. The centroid does
not need to be an actual data point; it simply represents the average
location of the cluster. The algorithm still works by iterating until the
centroids stabilize.
Why does this happen?
• Centroid is an average: The centroid is computed as the mean of all
points in a cluster. The mean of multiple points may result in a value
that is not present in the dataset.
• Continuous feature space: If your data points are in a continuous
space (e.g., real numbers), the mean will almost always be a new
point not present in the original dataset.
• Non-uniform distributions: If data points are widely scattered, the
centroid may lie in an empty region of the space.
What if we want centroids to be
data points?
• k-Medoids (PAM - Partitioning Around Medoids): Instead of taking
the mean, this selects a representative data point as the cluster
center.
• k-Modes (for categorical data): Uses the most frequent categorical
value instead of the mean.
• Modified k-means (K-means++): You can modify centroid selection by
restricting it to the nearest actual data point.