Clustering 2
Clustering 2
Clustering
Clustering depends
on the selected features
! ! !
s p x ∈ℜn
Pre Extracción (Cálculo)
Sensado Clustering
Procesamiento de Características
n: número de
características
Mundo
Real
1
15-06-21
n: número de
características
Notion of a Cluster can be Ambiguous
Real
Clustering algorithms
o Clustering algorithm comes in 2 basic flavors:
• Partitional Clustering: a division data objects into non-overlapping
Clustering Methods
subsets (clusters) such that each data object is in exactly one subset
• Hierarchical clustering: a set of nested clusters organized as a
hierarchical tree
Hierarchical
Partitioning
Partitioning Density-based
Clustering algorithms
o Clustering algorithm comes in 2 basic flavors:
• Partitional Clustering: a division data objects into non-overlapping
subsets (clusters) such that each data object is in exactly one subset
• Hierarchical clustering: a set of nested clusters organized as a
hierarchical tree
Hierarchical
Hierarchical
Partitioning
Master Bioinformatics, Freie Universitaet
Matteo Pardo
Clustering 7-6-10 - 24 -
2
15-06-21
Partitioning Clustering
K-means Clustering
3
15-06-21
Limitations
Master Bioinformatics,of
FreieK-means:
Universitaet Non-globular
Matteo Pardo Shapes Master Bioinformatics, Freie Universitaet
Clustering 7-6-10 - 42 -
Matteo Pardo
Clustering 7-6-10 - 41 -
4
15-06-21
Hierarchical Clustering
Hierarchical Clustering
• Produces a set of nested clusters
organized as a hierarchical tree
• Can be visualized as a dendrogram
– A tree-like diagram that records the
sequences of merges or splits
6 5
0.2
4
3 4
0.15 2
5
0.1 2
0.05 1
3 1
0
1 3 2 5 4 6
5
15-06-21
Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or
there are k clusters)
6
15-06-21
p1 p2 p3
two
p4 p5
clusters
... 6 5
0.2
p1 4
Similaridad? 0.15 3 4
p2 2
5
0.1
p3 2
0.05
p4 1
3 1
p5 0
Master Bioinformatics, Freie Universitaet
1 3 2 5 4 6
. Matteo Pardo
Clustering 7-6-10 - 51 -
.
MIN
how
. to measure the distance between clusters is relevant
Master Bioinformatics, Freie Universitaet
(min,
Matteo Pardo max, mean, etc.)
MAX Matriz de proximidad Clustering 7-6-10 - 48 -
o MIN o MIN
o MAX o MAX
o Group Average o Group Average
o Distance Between Centroids o Distance Between Centroids
o MIN
o MIN
o MAX o MAX
o Group Average o Group Average
o Distance Between Centroids o Distance Between Centroids
Other methods driven by an objective function:
E.g. Ward linkage:
• Similarity based on the increase in SSE when two clusters are merged
• Similar to group average (uses all points)
• HC equivalent to k-means
Master Bioinformatics, Freie Universitaet
Matteo Pardo
Clustering 7-6-10 - 55 - Master Bioinformatics, Freie Universitaet
Matteo Pardo
Clustering 7-6-10 - 56 -
7
15-06-21
Density-based Clustering
DBSCAN
8
15-06-21
Density Definition
ε-Neighborhood of p
ε ε ε-Neighborhood of q
q p
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)
9
15-06-21
Example
= 10, MinPts = 4
6
Density-reachability
• Directly density-reachable
• An object q is directly density-reachable from object p
if p is a core object and q is in p’s -neighborhood.
10
15-06-21
Density-reachability
p
• p is (indirectly) density-reachable
p2 from q
p1 • q is not density-reachable from p
q
MinPts = 7
DBSCAN Algorithm: Example 8
• Parameter
• = 2 cm
• MinPts = 3
DBSCAN Algorithm
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
11
11
15-06-21
Clustering Validation
12
15-06-21
Clustering Validation
Clustering Validation
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e.,
distinguishing whether non-random structure actually exists
in the data.
2. Comparing the results of a cluster analysis to externally
known results, e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the
data without reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster
analyses to determine which is better.
5. Determining the ‘correct’ number of clusters.
13
15-06-21
Clustering
Measures of Cluster Validation
Validity
Numerical measures that are applied to judge various
aspects of cluster validity, are classified into the following
three types.
External Index: Used to measure the extent to which cluster
labels match externally supplied class labels.
Entropy
Internal Index: Used to measure the goodness of a clustering
structure without respect to external information.
Sum of Squared Error (SSE)
Relative Index: Used to compare two different clusterings or
clusters.
Often an external or internal index is used for this function, e.g., SSE or
entropy
Sometimes these are referred to as criteria instead of indices
However, sometimes criterion is the general strategy and index is the
numerical measure that implements the criterion.
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
14
The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around
zero indicate overlapping clusters.
The score is higher when clusters are dense and well separated, which relates to a standard concept of a
cluster.
Selecting the number of clusters with silhouette analysis on KMeans clustering : In this example the
silhouette analysis is used to choose an optimal value for n_clusters.
Clustering Validation
2.3.10.6. Calinski-Harabasz Index
If the ground truth labels are not known
»
If the ground truth labels are not known, the Calinski-Harabasz index
with be the number of points in our data, be the set of points in cluster , be the center of cluster , be the
(sklearn.metrics.calinski_harabasz_score) - also known as the Variance Ratio Criterion - can be used to
center of , be the number of points in Calinski-Harabasz
cluster . Index
evaluate the model, where a higher Calinski-Harabasz score relates to a model with better defined clusters. >>>
>>> from sklearn import metrics
>>> from sklearn.metrics import pairwise_distances
For>>> clusters, the Calinski-Harabasz
from sklearn import datasets score is given as the ratio of the between-clusters dispersion mean and the
>>> dataset = datasets.load_iris()
within-cluster dispersion:
>>> X = dataset.data
>>> y = dataset.target
In normal usage, the Calinski-Harabasz index is applied to the results of a cluster analysis.
>>> import numpy as np >>>
>>> from sklearn.cluster import KMeans
2.3. Clustering
where — scikit-learn 0.21.2 documentation
is the between group dispersion matrix and https://scikit-learn.org/stable/modules/clustering.html
is the within-cluster dispersion matrix defined by:
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X) Next
>>> labels = kmeans_model.labels_
>>> metrics.calinski_harabasz_score(X, labels)
561.62...
29 of 32 2019-05-30, 16:01
2.3.10.6.1. Advantages
The score is higher when clusters are dense and well separated, which relates to a standard concept of a
»
cluster.
withThe be the is
score number
fast to of points in our data,
compute be the set of points in cluster , be the center of cluster , be the
center of , be the number of points in cluster .
>>>
2.3.10.6.2.
>>> fromDrawbacks
sklearn import metrics
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn import datasets
>>> dataset = datasets.load_iris()
The Calinski-Harabasz index is generally higher for convex clusters
>>> X = dataset.data
than other concepts of clusters, such as
density
>>> based clusters like those obtained through DBSCAN.
y = dataset.target
In normal usage, the Calinski-Harabasz index is applied to the results of a cluster analysis.
References
>>> import numpy as np >>>
>>>Caliński, T., & Harabasz,import
from sklearn.cluster J. (1974). “A dendrite method for cluster analysis”.
KMeans Communications in Statistics-
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
>>>theory
labelsand Methods 3: 1-27. doi:10.1080/03610926.2011.560741.
= kmeans_model.labels_
Clustering Validation
>>> metrics.calinski_harabasz_score(X, labels)
561.62...
References
A simple choice to construct so that it is nonnegative and symmetric is:
30 of 32 2019-05-30, 16:01
Caliński, T., & Harabasz, J. (1974). “A dendrite method for cluster analysis”. Communications in Statistics-
theory and Methods 3: 1-27. doi:10.1080/03610926.2011.560741.
If the ground truth labels are not known, the Davies-Bouldin index (sklearn.metrics.davies_bouldin_score) can
be used to evaluate the model, where a lower Davies-Bouldin index relates to a model with better separation between
the clusters. Zero is the lowest possible score. Values closer to zero indicate a better partition.
The index is In normal usage, the Davies-Bouldin index is applied to the results of a cluster analysis as follows:
defined as the average similarity between each cluster for and its most similar one .
>>>this
from sklearn import datasets >>>
In the context of index, similarity is defined as a measure that trades off:
>>> iris = datasets.load_iris()
>>> X = iris.data
, the average
>>> fromdistance between each
sklearn.cluster point
import of cluster and the centroid
KMeans of that cluster – also know as cluster
>>> from sklearn.metrics import davies_bouldin_score
diameter.>>> kmeans = KMeans(n_clusters=3, random_state=1).fit(X)
, the >>> labels
distance = kmeans.labels_
between cluster centroids and . Next
>>> davies_bouldin_score(X, labels)
0.6619...
30 of 32 2019-05-30, 16:01
15
2.3.10.7.1. Advantages
Clustering Validation
If the ground truth labels are not known
Evaluating partitions by Sum of Squared Error (SSE)
Sum of Squared Error (SSE)
K
SSE = ∑ ∑ dist 2 ( mi , x )
i =1 x∈Ci
Clustering Validation
If the ground truth labels are not known
Internal Measures: SSE
Evaluating partitions by Sum of Squared Error (SSE)
Sum of Squared Error (SSE)
Clusters in more complicated figures aren’t well separated
Internal Index: Used toKmeasure 2the goodness of a
clustering structure =
SSEwithout dist ( m
respect ∑∑
toi ,external
x) information
i =1 x∈Ci
SSE
SSE is good for comparing two clusterings or two clusters
x is aSSE).
(average data point in cluster Ci and mi is the representative
point for cluster
Can also be used Ci
to estimate the number of clusters
o Given two partitions, we can choose the one with the
10
6 9
4
smallest error 8
2 6
of clusters
SSE
5
0
4
-2 • For deciding cluster number: look3 at the knee in SSE
-4 • Still: A good clustering with smaller
2 K can have a lower SSE than
-6
a poor clustering with higher K 1
0
2 5 10 15 20 25 30
5 10 15
K
16
2.3.10.4.2. Drawbacks
Contrary to inertia, FMI-based measures require the knowledge of the ground truth classes while almost
never available in practice or requires manual assignment by human annotators (as in the supervised learning
setting).
15-06-21
References
E. B. Fowkles and C. L. Mallows, 1983. “A method for comparing two hierarchical clusterings”. Journal of the
American Statistical Association. http://wildfire.stat.ucla.edu/pdflibrary/fowlkes.pdf
CLUSTERING VALIDATION TECHNIQUES 125
Internal Measures: Silhouette Coefficient
Wikipedia entry for the Fowlkes-Mallows Index
The Silhouette Coefficient for a set of samples is given as the mean of the Silhouette Coefficient for0 each sample.
Figure 3. Confidence interval for (a) two-tailed index, (b) right-tailed index, (c) left-tailed index, where qp is
Can calculate the Average Silhouette width for a cluster or a
of q under hypothesis H0 . (Theodoridis and Koutroubas, 1999).
the ρ proportion Next
clustering
The Silhouette coefficient can be calculated for a cluster or for a clustering.
28 of 32 Assuming that this shape is right-tailed (figure 3b) and that we have generated the scatter- 2019-05-30, 16:01
plot using r values of the index q, called qi , in order to accept or reject the Null Hypothesis
H0 we examine the following conditions (Theodoridis and Koutroubas, 1999):