Artificial Intelligence (CSC9YE)
Hierarchical Clustering
Gabriela Ochoa
gabriela.ochoa@stir.ac.uk
Hierarchical Clustering
I K-means clustering requires us to pre-specify the number of
clusters K. This can be a disadvantage.
I Hierarchical clustering does not require that we commit to a
particular choice of K.
I Hierarchical clustering algorithms produce a hierarchy which
can be visualised as a dendrogram (tree).
I The histogram helps us to decide on the number of clusters K.
I Two types of hierarchical clustering: agglomerative
(bottom-up) and divisive (top-down).
1 / 34
Types of Hierarchical Clustering
I Agglomerative (bottom-up, agglomerative nesting)
I Each example starts as a single-element cluster (leaf)
I At each step, the two clusters that are the most similar are
combined into a new bigger cluster (nodes)
I Iterate until all points are member a single big cluster (root).
I Divisive (top-down, divisive analysis)
I Start with a single cluster (root) with all examples
I At each step, the most heterogenous cluster is divided into two
I Iterate, until all examples are in their own cluster
2 / 34
Measuring Similarity
I We we need a way of measuring the similarity between two
clusters
I As we discussed for K-Means, we can measure the
(dis)similarity of observations using a distance measure, such
as the Euclidean distance
I How do we measure the dissimilarity between two clusters of
observations?
I A number of different cluster agglomeration methods (i.e,
linkage methods) have been proposed.
I Minimum or single linkage
I Maximum or complete linkage
I Mean or average linkage
I Inter cluster variance
3 / 34
Linkage Methods
I The linkage method determines the metric used for merging
clusters. Two clusters are merged if thy are close.
I All pairwise distances between the observations in cluster A
and the observations in cluster B, are computed.
Linkage Description
Single Minimal inter-cluster distance. Shortest distance
between observations of pairs of clusters.
Complete Maximal inter-cluster distance.Maximum distance
between observations of pairs of clusters.
Average Mean inter-cluster distance. Average distance be-
tween observations of pairs of clusters.
Ward Inter-cluster variance. Variance of the pairs of clus-
ters. Sum of squared differences within all clusters.
4 / 34
Agglomerative Clustering
Pseudo-code
Input: set of examples without labels
I Let each example form one cluster. For N examples, this
means creating N clusters, each containing a single example
I Find a pair of clusters with the smallest cluster-to-cluster
distance.
I Merge the two clusters into one, thus reducing the total
number of clusters to N − 1.
I Iterate until all examples are member a single big cluster.
5 / 34
Agglomerative Clustering
Intuitive simple example
Raw data
Dendogram
6 / 34
Real-world Dataset
Violent Crime Rates by US State
Statistics, in arrests per 100,000 residents, in each of the 50 US
states in 1973.
I Murder: Murder arrests (per 100,000)
I Assault: Assault arrests (per 100,000)
I UrbanPop: Percent of urban population
I Rape: Rape arrests (per 100,000)
7 / 34
Dendrogram after Agglomerative Clustering
Violent Crime Rates by US State
I Each leaf corresponds to one observation.
I As we move up the tree, observations that are similar to each
other are combined into branches, which are themselves fused
at a higher height.
I The vertical axis indicates dissimilarity. The higher the height
of the fusion, the less similar the observations are.
8 / 34
How many clusters?
Violent Crime Rates by US State
I The height of the cut to the dendrogram controls the number
of clusters obtained.
I It plays the same role as the K in K-means clustering
I In order to identify sub-groups (i.e. clusters), we need cut the
dendrogram. Here an example of cutting the tree into 4
groups
9 / 34
Agglomerative Clustering
Synthetic small dataset
1
8
10
6
4
4
5 6
2
7
0
0 2 4 6 8
10 / 34
Hierarchical Clustering
Synthetic small dataset
Distance Matrix
1 2 3 4 5 6 7 8 9 10
1 0.0
2 1.0 0.0
3 3.6 2.8 0.0
4 4.0 3.0 2.2 0.0
5 6.0 5.0 3.6 2.0 0.0
6 6.1 5.1 4.2 2.2 1.0 0.0
7 10.0 9.2 6.4 7.2 6.3 7.3 0.0
8 8.6 7.8 5.0 5.8 5.1 6.1 1.4 0.0
9 8.6 8.1 5.4 7.1 7.1 8.1 3.2 2.8 0.0
10 5.4 5.1 3.2 5.4 6.4 7.2 6.1 5.0 3.6 0.0
Clusters: {1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {10}
11 / 34
Hierarchical Clustering
Single (minimal) distance
1
8
10
6
4
4
5 6
2
7
0
0 2 4 6 8
12 / 34
Hierarchical Clustering
d k Clusters Comment
0.0 10 {1}, {2}, {3}, {4}, {5}, Start with each observation as one
{6}, {7}, {8}, {9}, {10} cluster.
1.0 8 {1, 2}, {3}, {4}, {5, 6}, Merge {1} and {2} as well as {5}
{7}, {8}, {9}, {10} and {6} since they are the closest:
d(1,2)=1 and d(5,6)=1
13 / 34
Hierarchical Clustering
Distance Matrix
1 2 3 4 5 6 7 8 9 10
1 0.0
2 1.0 0.0
3 3.6 2.8 0.0
4 4.0 3.0 2.2 0.0
5 6.0 5.0 3.6 2.0 0.0
6 6.1 5.1 4.2 2.2 1.0 0.0
7 10.0 9.2 6.4 7.2 6.3 7.3 0.0
8 8.6 7.8 5.0 5.8 5.1 6.1 1.4 0.0
9 8.6 8.1 5.4 7.1 7.1 8.1 3.2 2.8 0.0
10 5.4 5.1 3.2 5.4 6.4 7.2 6.1 5.0 3.6 0.0
Clusters: {1, 2}, {3}, {4}, {5, 6}, {7}, {8}, {9}, {10}
14 / 34
Hierarchical Clustering
1
8
2
10
6
4
4
5 6
2
7
0
0 2 4 6 8
15 / 34
Hierarchical Clustering
d k Clusters Comment
0.0 10 {1}, {2}, {3}, {4}, {5}, Start with each observation as one
{6}, {7}, {8}, {9}, {10} cluster.
1.0 8 {1, 2}, {3}, {4}, {5, 6}, Merge {1} and {2} as well as {5}
{7}, {8}, {9}, {10} and {6} since they are the closest:
d(1,2)=1 and d(5,6)=1
1.4 7 {1, 2}, {3}, {4}, {5, 6}, Merge {7} and {8} since they are the
{7, 8}, {9}, {10} closest: d(7,8)=1.4
16 / 34
Hierarchical Clustering
Distance Matrix
1 2 3 4 5 6 7 8 9 10
1 0.0
2 1.0 0.0
3 3.6 2.8 0.0
4 4.0 3.0 2.2 0.0
5 6.0 5.0 3.6 2.0 0.0
6 6.1 5.1 4.2 2.2 1.0 0.0
7 10.0 9.2 6.4 7.2 6.3 7.3 0.0
8 8.6 7.8 5.0 5.8 5.1 6.1 1.4 0.0
9 8.6 8.1 5.4 7.1 7.1 8.1 3.2 2.8 0.0
10 5.4 5.1 3.2 5.4 6.4 7.2 6.1 5.0 3.6 0.0
Clusters: {1, 2}, {3}, {4}, {5, 6}, {7, 8}, {9}, {10}
17 / 34
Hierarchical Clustering
1
8
2
10
6
4
4
5 6
2
7
0
0 2 4 6 8
18 / 34
Hierarchical Clustering
d k Clusters Comment
0.0 10 {1}, {2}, {3}, {4}, {5}, Start with each observation as one
{6}, {7}, {8}, {9}, {10} cluster.
1.0 8 {1, 2}, {3}, {4}, {5, 6}, Merge {1} and {2} as well as {5}
{7}, {8}, {9}, {10} and {6} since they are the closest:
d(1,2)=1 and d(5,6)=1
1.4 7 {1, 2}, {3}, {4}, {5, 6}, Merge {7} and {8} since they are the
{7, 8}, {9}, {10} closest: d(7,8)=1.4
2.0 6 {1, 2}, {3}, {4, 5, 6}, Merge {4} and {5, 6} since 4 and 5
{7, 8}, {9}, {10} are the closest: d(4,5)=2.0
19 / 34
Hierarchical Clustering
Distance Matrix
1 2 3 4 5 6 7 8 9 10
1 0.0
2 1.0 0.0
3 3.6 2.8 0.0
4 4.0 3.0 2.2 0.0
5 6.0 5.0 3.6 2.0 0.0
6 6.1 5.1 4.2 2.2 1.0 0.0
7 10.0 9.2 6.4 7.2 6.3 7.3 0.0
8 8.6 7.8 5.0 5.8 5.1 6.1 1.4 0.0
9 8.6 8.1 5.4 7.1 7.1 8.1 3.2 2.8 0.0
10 5.4 5.1 3.2 5.4 6.4 7.2 6.1 5.0 3.6 0.0
Clusters: {1, 2}, {3}, {4, 5, 6}, {7, 8}, {9}, {10}
20 / 34
Hierarchical Clustering
1
8
2
10
6
4
4
5 6
2
7
0
0 2 4 6 8
21 / 34
Hierarchical Clustering
d k Clusters Comment
0.0 10 {1}, {2}, {3}, {4}, {5}, Start with each observation as one
{6}, {7}, {8}, {9}, {10} cluster.
1.0 8 {1, 2}, {3}, {4}, {5, 6}, Merge {1} and {2} as well as {5}
{7}, {8}, {9}, {10} and {6} since they are the closest:
d(1,2)=1 and d(5,6)=1
1.4 7 {1, 2}, {3}, {4}, {5, 6}, Merge {7} and {8} since they are the
{7, 8}, {9}, {10} closest: d(7,8)=1.4
2.0 6 {1, 2}, {3}, {4, 5, 6}, Merge {4} and {5, 6} since 4 and 5
{7, 8}, {9}, {10} are the closest: d(4,5)=2.0
2.2 5 {1, 2}, {3, 4, 5, 6}, Merge {3} and {4, 5, 6} since 3 and
{7, 8}, {9}, {10} 4 are the closest: d(3,4)=2.2
22 / 34
Hierarchical Clustering
Distance Matrix
1 2 3 4 5 6 7 8 9 10
1 0.0
2 1.0 0.0
3 3.6 2.8 0.0
4 4.0 3.0 2.2 0.0
5 6.0 5.0 3.6 2.0 0.0
6 6.1 5.1 4.2 2.2 1.0 0.0
7 10.0 9.2 6.4 7.2 6.3 7.3 0.0
8 8.6 7.8 5.0 5.8 5.1 6.1 1.4 0.0
9 8.6 8.1 5.4 7.1 7.1 8.1 3.2 2.8 0.0
10 5.4 5.1 3.2 5.4 6.4 7.2 6.1 5.0 3.6 0.0
Clusters: {1, 2}, {3, 4, 5, 6}, {7, 8}, {9}, {10}
23 / 34
Hierarchical Clustering
1
8
2
10
6
4
4
5 6
2
7
0
0 2 4 6 8
24 / 34
Hierarchical Clustering
d k Clusters Comment
0.0 10 {1}, {2}, {3}, {4}, {5}, Start with each observation as one
{6}, {7}, {8}, {9}, {10} cluster.
1.0 8 {1, 2}, {3}, {4}, {5, 6}, Merge {1} and {2} as well as {5}
{7}, {8}, {9}, {10} and {6} since they are the closest:
d(1,2)=1 and d(5,6)=1
1.4 7 {1, 2}, {3}, {4}, {5, 6}, Merge {7} and {8} since they are the
{7, 8}, {9}, {10} closest: d(7,8)=1.4
2.0 6 {1, 2}, {3}, {4, 5, 6}, Merge {4} and {5, 6} since 4 and 5
{7, 8}, {9}, {10} are the closest: d(4,5)=2.0
2.2 5 {1, 2}, {3, 4, 5, 6}, Merge {3} and {4, 5, 6} since 3 and
{7, 8}, {9}, {10} 4 are the closest: d(3,4)=2.2
2.8 3 {1, 2, 3, 4, 5, 6}, Merge {1, 2} and {3, 4, 5, 6} as
{7, 8, 9}, {10} well as {7, 8} and {9} since 2 and
3 as well as 8 and 9 are the closest:
d(2,3)=2.8 and d(8,9)=2.8
25 / 34
Hierarchical Clustering
Distance Matrix
1 2 3 4 5 6 7 8 9 10
1 0.0
2 1.0 0.0
3 3.6 2.8 0.0
4 4.0 3.0 2.2 0.0
5 6.0 5.0 3.6 2.0 0.0
6 6.1 5.1 4.2 2.2 1.0 0.0
7 10.0 9.2 6.4 7.2 6.3 7.3 0.0
8 8.6 7.8 5.0 5.8 5.1 6.1 1.4 0.0
9 8.6 8.1 5.4 7.1 7.1 8.1 3.2 2.8 0.0
10 5.4 5.1 3.2 5.4 6.4 7.2 6.1 5.0 3.6 0.0
Clusters: {1, 2, 3, 4, 5, 6}, {7, 8, 9}, {10}
26 / 34
Hierarchical Clustering
1
8
2
10
6
4
4
5 6
2
7
0
0 2 4 6 8
27 / 34
Hierarchical Clustering
d k Clusters Comment
0.0 10 {1}, {2}, {3}, {4}, {5}, Start with each observation as one
{6}, {7}, {8}, {9}, {10} cluster.
1.0 8 {1, 2}, {3}, {4}, {5, 6}, Merge {1} and {2} as well as {5}
{7}, {8}, {9}, {10} and {6} since they are the closest:
d(1,2)=1 and d(5,6)=1
1.4 7 {1, 2}, {3}, {4}, {5, 6}, Merge {7} and {8} since they are the
{7, 8}, {9}, {10} closest: d(7,8)=1.4
2.0 6 {1, 2}, {3}, {4, 5, 6}, Merge {4} and {5, 6} since 4 and 5
{7, 8}, {9}, {10} are the closest: d(4,5)=2.0
2.2 5 {1, 2}, {3, 4, 5, 6}, Merge {3} and {4, 5, 6} since 3 and
{7, 8}, {9}, {10} 4 are the closest: d(3,4)=2.2
2.8 3 {1, 2, 3, 4, 5, 6}, Merge {1, 2} and {3, 4, 5, 6} as
{7, 8, 9}, {10} well as {7, 8} and {9} since 2 and
3 as well as 8 and 9 are the closest:
d(2,3)=2.8 and d(8,9)=2.8
3.2 2 {1, 2, 3, 4, 5, 6, 10}, Merge {1, 2, 3, 4, 5, 6} and
{7, 8, 9} {10} since 3 and 10 are the closest:
d(3,10)=3.2
28 / 34
Hierarchical Clustering
Distance Matrix
1 2 3 4 5 6 7 8 9 10
1 0.0
2 1.0 0.0
3 3.6 2.8 0.0
4 4.0 3.0 2.2 0.0
5 6.0 5.0 3.6 2.0 0.0
6 6.1 5.1 4.2 2.2 1.0 0.0
7 10.0 9.2 6.4 7.2 6.3 7.3 0.0
8 8.6 7.8 5.0 5.8 5.1 6.1 1.4 0.0
9 8.6 8.1 5.4 7.1 7.1 8.1 3.2 2.8 0.0
10 5.4 5.1 3.2 5.4 6.4 7.2 6.1 5.0 3.6 0.0
Clusters: {1, 2, 3, 4, 5, 6, 10}, {7, 8, 9}
29 / 34
Hierarchical Clustering
1
8
2
10
6
4
4
5 6
2
7
0
0 2 4 6 8
30 / 34
Hierarchical Clustering
d k Clusters Comment
0.0 10 {1}, {2}, {3}, {4}, {5}, Start with each observation as one
{6}, {7}, {8}, {9}, {10} cluster.
1.0 8 {1, 2}, {3}, {4}, {5, 6}, Merge {1} and {2} as well as {5}
{7}, {8}, {9}, {10} and {6} since they are the closest:
d(1,2)=1 and d(5,6)=1
1.4 7 {1, 2}, {3}, {4}, {5, 6}, Merge {7} and {8} since they are the
{7, 8}, {9}, {10} closest: d(7,8)=1.4
2.0 6 {1, 2}, {3}, {4, 5, 6}, Merge {4} and {5, 6} since 4 and 5
{7, 8}, {9}, {10} are the closest: d(4,5)=2.0
2.2 5 {1, 2}, {3, 4, 5, 6}, Merge {3} and {4, 5, 6} since 3 and
{7, 8}, {9}, {10} 4 are the closest: d(3,4)=2.2
2.8 3 {1, 2, 3, 4, 5, 6}, Merge {1, 2} and {3, 4, 5, 6} as
{7, 8, 9}, {10} well as {7, 8} and {9} since 2 and
3 as well as 8 and 9 are the closest:
d(2,3)=2.8 and d(8,9)=2.8
3.2 2 {1, 2, 3, 4, 5, 6, 10}, Merge {1, 2, 3, 4, 5, 6} and
{7, 8, 9} {10} since 3 and 10 are the closest:
d(3,10)=3.2
3.6 1 {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} Merge remaining two clusters,
d(9,10)=3.6
31 / 34
Hierarchical Clustering
Single Linkage Cluster Dendrogram
3.5
3.0
10
2.5
9
Height
2.0
4
1.5
1.0
6
32 / 34
Height
1.0 1.5 2.0 2.5 3.0 3.5
9
7
8
10
1
2
3
Single Linkage
4
5
6
Height
Hierarchical Clustering
0 2 4 6 8 10
9
7
8
5
6
10
1
2
Complete Linkage
3
4
Height
1 2 3 4 5 6 7
9
7
8
4
5
6
1
Average Linkage
2
3
10
33 / 34
Conclusions
I Hierarchical clustering seeks to build a hierarchy of clusters.
I We can decided on the number of clusters after exploring the
dendrogram.
I Unsupervised learning is important for understanding the
variation and grouping structure of a set of unlabeled data,
and can be a useful pre-processor for supervised learning.
I It is intrinsically more difficult than supervised learning
because there is no gold standard (like an outcome variable)
and no single objective (like test set accuracy).
34 / 34