Unsupervised Learning
Dr. Dinesh Kumar Vishwakarma
Professor,
Department of Information Technological University,
Delhi-110042
Outline
◼ Basic concepts
◼ K-means algorithm
◼ Representation of clusters
◼ Hierarchical clustering
◼ Which clustering algorithm to use?
◼ Cluster evaluation
◼ Summary
Dinesh K. Vishwakarma 2
Supervised vs. unsupervised learning
◼ Supervised learning: discover patterns in
the data that relate data attributes with a
target (class) attribute.
✓ These patterns are then utilized to predict the values
of the target attribute in future data instances.
◼ Unsupervised learning: The data have no
target attribute.
✓ We want to explore the data to find some intrinsic
structures in them.
Dinesh K. Vishwakarma 3
Clustering
◼ Clustering is a technique for finding similarity
groups in data, called clusters. i.e.,
❑ it groups data instances that are similar to (near) each other in
one cluster and data instances that are very different (far
away) from each other into different clusters.
◼ Clustering is often called an unsupervised learning
task as no class values denoting an a priori
grouping of the data instances are given, which is
the case in supervised learning.
◼ In fact, association rule mining is also
unsupervised.
Dinesh K. Vishwakarma 4
What is Clusters?
◼ The organization of unlabeled
data into similarity groups
called ‘clusters’.
◼ A cluster is a collection of data
items which are “similar”
between them, & “dissimilar”
to data items in other clusters.
◼ The data set has three natural
groups of data points, i.e., 3
natural clusters.
Dinesh K. Vishwakarma 5
What is clustering for?
◼ Example 1: groups people of similar sizes
together to make “small”, “medium” and
“large” T-Shirts.
❑ Tailor-made for each person: too expensive
❑ One-size-fits-all: does not fit all.
◼ Example 2: In marketing, segment
customers according to their similarities
❑ To do targeted marketing.
❑ Residential Area
Dinesh K. Vishwakarma 6
What is clustering for? (cont…)
◼ Example 3: Given a collection of text
documents, we want to organize them
according to their content similarities,
❑ To produce a topic hierarchy
◼ In fact, clustering is one of the most utilized
data mining techniques.
❑ It has a long history, and used in almost every field,
e.g., medicine, psychology, botany, sociology,
biology, archeology, marketing, insurance,
libraries, etc.
❑ In recent years, due to the rapid increase of online
documents, text clustering becomes important.
Dinesh K. Vishwakarma 7
What is clustering for? (cont…)
◼ Computer Vision
Dinesh K. Vishwakarma 8
What do we need for Clustering?
◼ Proximity measures between two data
points (𝒙𝒊 , 𝒙𝒋 ).
❑ Similarity measures 𝐬(𝒙𝒊 , 𝒙𝒋 ): large if 𝒙𝒊 , 𝑎𝑛𝑑 𝒙𝒋 are
similar.
❑ dissimilarity measures (distance) 𝐝(𝒙𝒊 , 𝒙𝒋 ): small if
𝒙𝒊 , 𝑎𝑛𝑑 𝒙𝒋 are similar
Large d, & small s small d, & large s
◼ Criteria Function to evaluate a clustering
Good Bad
Dinesh K. Vishwakarma 9
Distance Measurement
◼ Consider two data points 𝒙𝒊 , 𝒙𝒋
𝒙𝒊 𝒙𝒋
◼ Distance between two points can be defined as
Minkowski Distance.
𝟏
𝒑
❑ 𝒅𝒑 (𝒙𝒊, 𝒙𝒋 ) = σ𝒎
𝒌=𝟏 𝒙𝒊𝒌 − 𝒙𝒋𝒌
𝒑
, where p is positive integer.
◼ Euclidian Distance when p=2
𝒌 𝒌 𝟐
❑ 𝑑 𝒙𝒊 , 𝒙𝒋 = σ𝒎
𝒌=𝟏 𝒙𝒊 − 𝒙𝒋 Translation Invariant
◼ Manhattan (City Block) Distance when p=1
𝒌 𝒌
❑ 𝑑 𝒙𝒊 , 𝒙𝒋 = σ𝒎
𝒌=𝟏 𝒙𝒊 − 𝒙𝒋 , Cheaper to compute.
Dinesh K. Vishwakarma 10
Clustering E.g.
◼ Consider a data
table
𝑿𝟏 C1 {X7,X8}
𝑿𝟐
𝑿𝟑
𝑿𝟒
𝑿𝟓 ◼ Clustering C2 { X1, X3, X5, X6}
𝑿𝟔
𝑿𝟕
𝑿𝟖
C3 {X2, X4, X9}
𝑿𝟗
Dinesh K. Vishwakarma 11
Aspects of Clustering
◼ A clustering algorithm
❑ Partitional clustering
❑ Hierarchical clustering
◼ A distance (similarity, or dissimilarity)
function
◼ Clustering quality
❑ Inter-clusters distance maximized
❑ Intra-clusters distance minimized
◼ The quality of a clustering result depends on
the algorithm, the distance function, and the
application.
Dinesh K. Vishwakarma 12
Hierarchical vs Partitional Clustering
◼ A distinction among different types of clustering
is whether the set of clusters is ‘nested’ or
‘unnested’.
◼ A partitional clustering a simply a division of
the set of data objects into non-overlapping
subsets (clusters) such that each data object is
in exactly one subset.
◼ A hierarchical clustering is a set of nested
clusters that are organized as a tree.
Dinesh K. Vishwakarma 13
K-means clustering
◼ K-means is a ‘Partitional Clustering
algorithm.
◼ Let the set of data points (or instances) D be
{x1, x2, …, xn},
❑ where xi = (xi1, xi2, …, xir) is a vector in a real-valued
space X Rr, and r is the number of attributes
(dimensions) in the data.
◼ The k-means algorithm partitions the given
data into k clusters.
❑ Each cluster has a cluster center, called centroid.
❑ k is specified by the user.
Dinesh K. Vishwakarma 14
K-means algorithm
◼ Given k, the k-means algorithm works as
follows:
1) Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
2) Assign each data point to the closest centroid.
3) Re-compute the centroids using the current cluster
memberships.
4) If a convergence criterion is not met, go to 2).
Dinesh K. Vishwakarma 15
K-means algorithm – (cont …)
◼ Algorithm k-means (k, D)
1 Choose k-data points as the initials centroids (cluster centres)
2 repeat
3 for each data point 𝒙 ∈ 𝑫 do
4 compute the distance from x to each centroid;
5 assign 𝒙 to the closet centroid // a centroid represents a
cluster;
6 endfor
7 re-compute the centroid using the current cluster
memberships;
8 untill the stopping criterion is met.
Dinesh K. Vishwakarma 16
Stopping/convergence criterion
1. No (or minimum) re-assignments of data points
to different clusters,
2. No (or minimum) change of centroids, or
3. Minimum decrease in the sum of squared error
(SSE), k
SSE = xC dist (x, m j ) 2 (1)
j
j =1
❑ Cj is the jth cluster, mj is the centroid of cluster Cj
(the mean vector of all the data points in Cj), and
dist(x, mj) is the distance between data point x and
centroid mj.
Dinesh K. Vishwakarma 17
An example
(A) Random selection of k centres Iteration 1 (B) Cluster assignment (C) Re-compute centroids
Iteration 2 (D) Cluster assignment (E) Re-compute centroids Iteration 3 (F) Cluster assignment
Dinesh K. Vishwakarma 18
Example
◼ Step 1: Randomly initialize ◼ Step 2: determine cluster
cluster centre membership to each input
Dinesh K. Vishwakarma 19
Example…
◼ Step 3: Re-estimate cluster ◼ Result of first iteration
centre
Dinesh K. Vishwakarma 20
Example…
◼ Second iteration ◼ Result of Second iteration
Dinesh K. Vishwakarma 21
Example of K-Mean Clustering
◼ Consider a data table ◼ Consider there are
Height (X) Weight (Y) two clusters (K1, K2) K1={185, 72}
185 72
and their centroids
are (185, 72) & (170,
170 56
56)
168 60
K2={170, 56}
179 68 ◼ Euclidian Distance
182 72 for data point ‘3’
188 77 Hence,
𝑲𝟏 → (𝟏𝟔𝟖 − 𝟏𝟖𝟓)𝟐 +(𝟔𝟎 − 𝟕𝟐)𝟐 =20.80 data point
180 71
3 belongs
180 70 𝑲𝟐 → (𝟏𝟔𝟖 − 𝟏𝟕𝟎)𝟐 +(𝟔𝟎 − 𝟓𝟔)𝟐 =4.48 to K2
183 84
◼ New Centroid for K2
180 88 𝟏𝟕𝟎+𝟏𝟔𝟖 𝟔𝟎+𝟓𝟔
𝑲𝟐 → ( , ) = (169, 58)
180 67 𝟐 𝟐
177 76 𝑲𝟐 (𝟐, 𝟑) 𝑲𝟏 (𝟏, 𝟒, 𝟓, 𝟔, 𝟕, 8, 9,10, 11, 12)
Dinesh K. Vishwakarma 22
Strengths of k-means
◼ Strengths:
❑ Simple: easy to understand and to implement
❑ Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
❑ Since both k and t are small. k-means is considered a linear
algorithm.
◼ K-means is the most popular clustering algorithm.
◼ Note that: it terminates at a local optimum if SSE is
used. The global optimum is hard to find due to
complexity.
Dinesh K. Vishwakarma 23
Weaknesses of k-means
◼ The algorithm is only applicable if the mean
is defined.
❑ For categorical data, k-mode - the centroid is
represented by most frequent values.
◼ The user needs to specify k.
◼ The algorithm is sensitive to outliers
❑ Outliers are data points that are very far away from
other data points.
❑ Outliers could be errors in the data recording or
some special data points with very different values.
Dinesh K. Vishwakarma 24
Weaknesses of k-means…
◼ Problems with outliers
Outlier
(A) Undesirable cluster
Outlier
(B) Ideal cluster
Dinesh K. Vishwakarma 25
Weaknesses of k-means...
◼ To deal with outliers
◼ One method is to remove some data points in the
clustering process that are much further away from
the centroids than other data points.
❑ To be safe, we may want to monitor these possible outliers
over a few iterations and then decide to remove them.
◼ Another method is to perform random sampling.
Since in sampling we only choose a small subset
of the data points, the chance of selecting an
outlier is very small.
❑ Assign the rest of the data points to the clusters by distance or
similarity comparison, or classification
Dinesh K. Vishwakarma 26
Weaknesses of k-means (cont …)
◼ The algorithm is sensitive to initial seeds.
◼ If we use different seeds: good results
Dinesh K. Vishwakarma 27
Weaknesses of k-means (cont …)
◼ Special Data Structures: The k-means
algorithm is not suitable for discovering clusters
that are not hyper-ellipsoids (or hyper-spheres).
Two natural clusters K-mean clusters
Dinesh K. Vishwakarma 28
K-means summary
◼ Despite weaknesses, k-means is still the
most popular algorithm due to its simplicity,
efficiency and
❑ other clustering algorithms have their own lists of
weaknesses.
◼ No clear evidence that any other clustering
algorithm performs better in general
❑ although they may be more suitable for some
specific types of data or applications.
◼ Comparing different clustering algorithms is
a difficult task. No one knows the correct
clusters!
Dinesh K. Vishwakarma 29
Outlines
◼ Basic concepts
◼ K-means algorithm
◼ Representation of clusters
◼ Hierarchical clustering
◼ Which clustering algorithm to use?
◼ Cluster evaluation
◼ Summary
Dinesh K. Vishwakarma 30
Common ways to represent clusters
◼ Use the centroid of each cluster to represent
the cluster.
❑ compute the radius and
❑ standard deviation of the cluster to determine its
spread in each dimension
❑ The centroid representation alone works well if the
clusters are of the hyper-spherical shape.
❑ If clusters are elongated or are of other shapes,
centroids are not sufficient
Dinesh K. Vishwakarma 31
Using classification model
◼ All the data points in a
cluster are regarded to
have the same class
label, e.g., the cluster
ID.
❑ run a supervised learning
algorithm on the data to
find a classification model. 𝑥 ≤ 2 → 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 1
𝑥 > 2, 𝑦 > 1.5 → 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 2
𝑥 > 2, 𝑦 ≤ 1.5 → 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 3
Dinesh K. Vishwakarma 32
Use frequent values to represent cluster
◼ This method is mainly for clustering of
categorical data (e.g., k-modes clustering).
◼ Main method used in text clustering, where a
small set of frequent words in each cluster is
selected to represent the cluster.
Dinesh K. Vishwakarma 33
Clusters of arbitrary shapes
◼ Hyper-elliptical and hyper-
spherical clusters are usually
easy to represent, using their
centroid together with spreads.
◼ Irregular shape clusters are hard
to represent. They may not be
useful in some applications.
❑ Using centroids are not suitable (upper
figure) in general.
❑ K-means clusters may be more useful
(lower figure), e.g., for making 2 size T-
shirts.
Dinesh K. Vishwakarma 34
Outlines
◼ Basic concepts
◼ K-means algorithm
◼ Representation of clusters
◼ Hierarchical clustering
◼ Which clustering algorithm to use?
◼ Cluster evaluation
◼ Summary
Dinesh K. Vishwakarma 35
Hierarchical Clustering
◼ In previous, a ‘flat’ clustering is considered.
◼ For some data hierarchical clustering is
more appropriate than ‘flat’.
Hierarchical
Clustering
Dinesh K. Vishwakarma 36
Why Hierarchical Clustering?
◼ It does not assume a particular value of 𝑘, as
needed by 𝑘-means clustering.
◼ The generated tree may correspond to a
meaningful taxonomy.
◼ Only a distance or “proximity” matrix is
needed to compute the hierarchical clustering.
Dinesh K. Vishwakarma 37
Why Hierarchical Clustering? E.g.
Dinesh K. Vishwakarma 38
Types of hierarchical clustering
◼ Agglomerative (bottom up) clustering: It builds the
dendrogram (tree) from the bottom level, and
❑ merges the most similar (or nearest) pair of clusters
❑ stops when all the data points are merged into a single cluster
(i.e., the root cluster).
◼ Divisive (top down) clustering: It starts with all data
points in one cluster, the root.
❑ Splits the root into a set of child clusters. Each child cluster is
recursively divided further
❑ stops when only singleton clusters of individual data points
remain, i.e., each cluster with only a single point
Dinesh K. Vishwakarma 39
Agglomerative approach
Initialization:
Each object is a cluster
Iteration:
a ab Merge two clusters which are
b abcde most similar to each other;
c Until all objects are merged
cde into a single cluster
d
de
e
Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up
Dinesh K. Vishwakarma 40
Divisive Approaches
Initialization:
All objects stay in one cluster
Iteration:
a ab Select a cluster and split it into
b abcde
two sub clusters
Until each leaf cluster contains
c
cde only one object
d
de
e
Step 4 Step 3 Step 2 Step 1 Step 0 Top-down
Dinesh K. Vishwakarma 41
Dendrogram
◼ A binary tree that shows how clusters are
merged/split hierarchically
◼ Each node on the tree is a cluster; each leaf
node is a singleton cluster
2
No. of clusters
3
Dinesh K. Vishwakarma 42
Agglomerative Algorithms
Dinesh K. Vishwakarma 43
How to Merge Clusters?
◼ How to measure the distance between
clusters?
◼ Single-link
◼ Complete-link
Distance?
◼ Average-link
◼ Centroid distance
Hint: Distance between clusters is usually
defined on the basis of distance between
objects.
Dinesh K. Vishwakarma 44
How to Define Inter-Cluster Distance
◼ Single-link d min (Ci , C j ) = min d ( p, q)
pCi , qC j
◼ Complete-link
◼ Average-link The distance between two
◼ Centroid distance clusters is represented by the
distance of the closest pair of
data objects belonging to
different clusters.
Dinesh K. Vishwakarma 45
How to Define Inter-Cluster Distance
◼ Single-link d min (Ci , C j ) = max d ( p, q)
pCi , qC j
◼ Complete-link
◼ Average-link The distance between two
◼ Centroid distance clusters is represented by the
distance of the farthest pair of
data objects belonging to
different clusters.
Dinesh K. Vishwakarma 46
How to Define Inter-Cluster Distance
◼ Single-link
d min (Ci , C j ) = avg d ( p, q)
◼ Complete-link pCi , qC j
◼ Average-link The distance between two clusters
◼ Centroid distance is represented by the average
distance of all pairs of data objects
belonging to different clusters.
Dinesh K. Vishwakarma 47
How to Define Inter-Cluster Distance
mi,mj are the
means of Ci, Cj,
◼ Single-link d mean (Ci , C j ) = d (mi , m j )
◼ Complete-link
◼ Average-link The distance between two
◼ Centroid distance clusters is represented by the
distance between the means
of the cluters.
Dinesh K. Vishwakarma 48
E.g. Agglomerative Hierarchical Clustering
◼ For the following data set, we will get
different clustering results with the single-
link and complete-link algorithms.
Result of the Single-Link algorithm
1 5
3 4
2 6
1 3 4 5 2 6
Result of the complete-Link algorithm
1 5
3 4
2 6
1 3 2 4 5 6
Dinesh K. Vishwakarma 49
Hierarchical Clustering: Comparison
Single-link 5 Complete-link
1 4
3 1
2 5
5 5
2 2
1
2 3 6 3 6
3 1
4 4
4
Average-link Centroid distance
5
1 5
2 41
5 2
2 5 2
3 6 3 6
4 3 1
4 4 1
3
Dinesh K. Vishwakarma
Comparison of Dendrograms
Single-link Complete-link
1 2 5 3 6 4 1 2 5 3 6 4
Average-link Centroid distance
1 2 5 3 6 4 2 5 3 6 4 1
Dinesh K. Vishwakarma 51
E.G. of Clustering
Dinesh K. Vishwakarma 52
E.G. of Clustering using Single Link
Dinesh K. Vishwakarma 53
E.G. of Clustering using Single Link
Dinesh K. Vishwakarma 54
E.G. of Clustering using Single Link
Dinesh K. Vishwakarma 55
E.G. of Clustering using Single Link
Dinesh K. Vishwakarma 56
Agglomerative clustering
It is more popular then divisive methods.
◼ At the beginning, each data point forms a
cluster (also called a node).
◼ Merge nodes/clusters that have the least
distance.
◼ Go on merging
◼ Eventually all nodes belong to one cluster
Dinesh K. Vishwakarma 57
Agglomerative clustering algorithm
Dinesh K. Vishwakarma 58
The complexity
◼ All the algorithms are at least O(n2). n is the
number of data points.
◼ Single link can be done in O(n2).
◼ Complete and average links can be done in
O(n2logn).
◼ Due the complexity, hard to use for large
data sets.
❑ Sampling
❑ Scale-up methods (e.g., BIRCH).
Dinesh K. Vishwakarma 59
Outlines
◼ Basic concepts
◼ K-means algorithm
◼ Representation of clusters
◼ Hierarchical clustering
◼ Which clustering algorithm to use?
◼ Cluster evaluation
◼ Summary
Dinesh K. Vishwakarma 60
How to choose a clustering algorithm
◼ Clustering research has a long history. A vast
collection of algorithms are available.
❑ We only introduced few key algorithms.
◼ Choosing the “best” algorithm is a challenge.
❑ Every algorithm has limitations and works well with certain
data distributions.
❑ It is very hard, if not impossible, to know what distribution the
application data follow. The data may not fully follow any “ideal”
structure or distribution required by the algorithms.
❑ One also needs to decide how to standardize the data, to
choose a suitable distance function and to select other
parameter values.
Dinesh K. Vishwakarma 61
Choose a clustering algorithm (cont …)
◼ Due to these complexities, the common practice is
to
❑ run several algorithms using different distance functions and
parameter settings, and
❑ then carefully analyze and compare the results.
◼ The interpretation of the results must be based on
insight into the meaning of the original data
together with knowledge of the algorithms used.
◼ Clustering is highly application dependent and to
certain extent subjective (personal preferences).
Dinesh K. Vishwakarma 62
Outlines
◼ Basic concepts
◼ K-means algorithm
◼ Representation of clusters
◼ Hierarchical clustering
◼ Which clustering algorithm to use?
◼ Cluster evaluation
◼ Summary
Dinesh K. Vishwakarma 63
Cluster Evaluation: hard problem
◼ The quality of a clustering is very hard to
evaluate because
❑ We do not know the correct clusters
◼ Some methods are used:
❑ User inspection
◼ Study centroids, and spreads
◼ Rules from a decision tree.
◼ For text documents, one can read some documents in
clusters.
Dinesh K. Vishwakarma 64
Cluster evaluation: ground truth
◼ We use some labeled data (for
classification)
◼ Assumption: Each class is a cluster.
◼ After clustering, a confusion matrix is
constructed. From the matrix, we compute
various measurements, entropy, purity,
precision, recall and F-score.
❑ Let the classes in the data D be C = (c1, c2, …, ck).
The clustering method produces k clusters, which
divides D into k disjoint subsets, D1, D2, …, Dk.
Dinesh K. Vishwakarma 65
Evaluation measures: Entropy
Dinesh K. Vishwakarma 66
Evaluation measures: purity
Dinesh K. Vishwakarma 67
An example
◼ Assume we have a text collection Dof 900 documents from three
topics (or three classes), Science, Sports, and Politics. Each class has
300 documents, and each document in D is labeled with one of the
topics (classes). We use this collection to perform clustering to find
three clusters. Class/topic labels are not used in clustering. After
clustering, we want to measure the effectiveness of the clustering
algorithm.
Dinesh K. Vishwakarma 68
A remark about ground truth evaluation
◼ Commonly used to compare different clustering
algorithms.
◼ A real-life data set for clustering has no class
labels.
❑ Thus although an algorithm may perform very well on some
labeled data sets, no guarantee that it will perform well on the
actual application data at hand.
◼ The fact that it performs well on some label data
sets does give us some confidence of the quality of
the algorithm.
◼ This evaluation method is said to be based on
external data or information.
Dinesh K. Vishwakarma 69
Evaluation based on internal information
◼ Intra-cluster cohesion (compactness):
❑ Cohesion measures how near the data points in a
cluster are to the cluster centroid.
❑ Sum of squared error (SSE) is a commonly used
measure.
◼ Inter-cluster separation (isolation):
❑ Separation means that different cluster centroids
should be far away from one another.
◼ In most applications, expert judgments are
still the key.
Dinesh K. Vishwakarma 70
Indirect evaluation
◼ In some applications, clustering is not the primary
task, but used to help perform another task.
◼ We can use the performance on the primary task to
compare clustering methods.
◼ For instance, in an application, the primary task is
to provide recommendations on book purchasing
to online shoppers.
❑ If we can cluster books according to their features, we might
be able to provide better recommendations.
❑ We can evaluate different clustering algorithms based on how
well they help with the recommendation task.
❑ Here, we assume that the recommendation can be reliably
evaluated.
Dinesh K. Vishwakarma 71
An example
Dinesh K. Vishwakarma 72
Summary
◼ Clustering is has along history and still active
❑ There are a huge number of clustering algorithms
❑ More are still coming every year.
◼ We only introduced several main algorithms. There
are many others, e.g.,
❑ density based algorithm, sub-space clustering, scale-up
methods, neural networks based methods, fuzzy clustering,
co-clustering, etc.
◼ Clustering is hard to evaluate, but very useful in
practice. This partially explains why there are still a
large number of clustering algorithms being
devised every year.
◼ Clustering is highly application dependent and to
some extent subjective.
Dinesh K. Vishwakarma 73