KEMBAR78
Clustering 2 | PDF | Cluster Analysis | Machine Learning
0% found this document useful (0 votes)
60 views17 pages

Clustering 2

The document discusses different clustering algorithms. It begins by explaining that clustering depends on selected features and distances and validity measures. It then describes the two main types of clustering algorithms: partitional clustering which divides data into non-overlapping clusters, and hierarchical clustering which produces nested clusters in a tree structure. The document focuses on k-means clustering, describing how it works and its limitations, such as differing cluster sizes, densities, and non-globular shapes. It also covers hierarchical clustering, explaining agglomerative and divisive approaches and how traditional algorithms use a similarity matrix to merge or split clusters.

Uploaded by

daniel.olea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views17 pages

Clustering 2

The document discusses different clustering algorithms. It begins by explaining that clustering depends on selected features and distances and validity measures. It then describes the two main types of clustering algorithms: partitional clustering which divides data into non-overlapping clusters, and hierarchical clustering which produces nested clusters in a tree structure. The document focuses on k-means clustering, describing how it works and its limitations, such as differing cluster sizes, densities, and non-globular shapes. It also covers hierarchical clustering, explaining agglomerative and divisive approaches and how traditional algorithms use a similarity matrix to merge or split clusters.

Uploaded by

daniel.olea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

15-06-21

Clustering

Clustering depends
on the selected features
! ! !
s p x ∈ℜn
Pre Extracción (Cálculo)
Sensado Clustering
Procesamiento de Características

n: número de
características

Mundo
Real

1
15-06-21

After feature selection, clustering depends


on distances & validity measures
! ! !
s p x ∈ℜn
Pre Extracción (Cálculo)
Sensado Clustering
Procesamiento de Características

n: número de
características
Notion of a Cluster can be Ambiguous

Mundo How many clusters? Six Clusters

Real

Two Clusters Four Clusters

Master Bioinformatics, Freie Universitaet


Matteo Pardo
Clustering 7-6-10 - 8 -

Clustering algorithms
o Clustering algorithm comes in 2 basic flavors:
• Partitional Clustering: a division data objects into non-overlapping
Clustering Methods
subsets (clusters) such that each data object is in exactly one subset
• Hierarchical clustering: a set of nested clusters organized as a
hierarchical tree
Hierarchical
Partitioning
Partitioning Density-based

Clustering algorithms
o Clustering algorithm comes in 2 basic flavors:
• Partitional Clustering: a division data objects into non-overlapping
subsets (clusters) such that each data object is in exactly one subset
• Hierarchical clustering: a set of nested clusters organized as a
hierarchical tree
Hierarchical
Hierarchical
Partitioning
Master Bioinformatics, Freie Universitaet
Matteo Pardo
Clustering 7-6-10 - 24 -

Master Bioinformatics, Freie Universitaet


Matteo Pardo
Clustering 7-6-10 - 24 -

2
15-06-21

Partitioning Clustering

K-means Clustering

o Partitional clustering approach


o Number of clusters, K, must be specified
K-Means

Master Bioinformatics, Freie Universitaet


Matteo Pardo
Clustering 7-6-10 - 26 -

3
15-06-21

K-means Clustering K-Means


– Details

o Initial centroids are often chosen randomly.


• Clusters produced vary from one run to another.
o The centroid is (typically) the mean of the points in the
cluster.
o ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
o Most of the convergence happens in the first few
iterations.
• Often the stopping condition is changed to ‘Until relatively few points
change clusters’
o Complexity is O( n * K * I * d )
• n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

Master Bioinformatics, Freie Universitaet


Matteo Pardo
Clustering 7-6-10 - 27 -

Limitations of K-means: Differing Sizes Limitations of K-means: Differing Density

K-means (3 Clusters) Original Points K-means (3 Clusters)


Original Points

Limitations
Master Bioinformatics,of
FreieK-means:
Universitaet Non-globular
Matteo Pardo Shapes Master Bioinformatics, Freie Universitaet
Clustering 7-6-10 - 42 -
Matteo Pardo
Clustering 7-6-10 - 41 -

Original Points K-means (2 Clusters)

Master Bioinformatics, Freie Universitaet


Matteo Pardo
Clustering 7-6-10 - 43 -

4
15-06-21

Hierarchical Clustering

Hierarchical Clustering
• Produces a set of nested clusters
organized as a hierarchical tree
• Can be visualized as a dendrogram
– A tree-like diagram that records the
sequences of merges or splits
6 5
0.2
4
3 4
0.15 2
5
0.1 2

0.05 1
3 1

0
1 3 2 5 4 6

5
15-06-21

Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left

– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or
there are k clusters)

• Traditional hierarchical algorithms use a similarity or


distance matrix
– Merge or split one cluster at a time

Complexity of hierarchical clustering

• Distance matrix is used for deciding which


clusters to merge/split

• At least quadratic in the number of data


points

• Not usable for large datasets

6
15-06-21

Agglomerative Clustering Algorithm


o More popular hierarchical clustering technique
Agglomerative Clustering
o Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. RepeatHierarchical Clustering
4. Merge the two closest clusters
o Produces a set of nested clusters organized as a
5. Update the proximity
hierarchical tree matrix
6. Until only
o Cana be
single cluster
visualized as remains
a dendrogram
• A tree like diagram that records the sequences of merges or
Cálculo de la matriz de proximidad
o
Proximity Matrix Key operation is the computation of the proximity of
splits

p1 p2 p3
two
p4 p5
clusters
... 6 5
0.2

p1 4
Similaridad? 0.15 3 4
p2 2
5
0.1
p3 2
0.05
p4 1
3 1
p5 0
Master Bioinformatics, Freie Universitaet
1 3 2 5 4 6

. Matteo Pardo
Clustering 7-6-10 - 51 -
.
MIN
how
. to measure the distance between clusters is relevant
Master Bioinformatics, Freie Universitaet
(min,
Matteo Pardo max, mean, etc.)
MAX Matriz de proximidad Clustering 7-6-10 - 48 -

Promedio del grupo


Distancia entre centroides
Marta Zorrilla - Universidad de Cantabria 2006/07 64

How to Define Inter-Cluster Similarity How to Define Inter-Cluster Similarity

o MIN o MIN
o MAX o MAX
o Group Average o Group Average
o Distance Between Centroids o Distance Between Centroids

How to Define Inter-Cluster Similarity How to Define Inter-Cluster Similarity

Master Bioinformatics, Freie Universitaet


Master Bioinformatics, Freie Universitaet Matteo Pardo
Matteo Pardo Clustering 7-6-10 - 54 -
Clustering 7-6-10 - 53 - × ×

o MIN
o MIN
o MAX o MAX
o Group Average o Group Average
o Distance Between Centroids o Distance Between Centroids
Other methods driven by an objective function:
E.g. Ward linkage:
• Similarity based on the increase in SSE when two clusters are merged
• Similar to group average (uses all points)
• HC equivalent to k-means
Master Bioinformatics, Freie Universitaet
Matteo Pardo
Clustering 7-6-10 - 55 - Master Bioinformatics, Freie Universitaet
Matteo Pardo
Clustering 7-6-10 - 56 -

7
15-06-21

Density-based Clustering

DBSCAN

Taken from Jing Gao SUNY Buffalo

8
15-06-21

Density Definition

• -Neighborhood – Objects within a radius of from


an object.
N ( p) : {q | d ( p, q) }
• “High density” - ε-Neighborhood of an object contains
at least MinPts of objects.

ε-Neighborhood of p
ε ε ε-Neighborhood of q
q p
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)

Core, Border & Outlier

Outlier Given and MinPts,


categorize the objects into
Border three exclusive groups.

A point is a core point if it has more than a


Core specified number of points (MinPts) within
Eps—These are points that are at the
interior of a cluster.

A border point has fewer than MinPts


= 1unit, MinPts = 5 within Eps, but is in the neighborhood
of a core point.

A noise point is any point that is not a


core point nor a border point.
(noise points or outliers)
5

9
15-06-21

Example

Original Points Point types: core,


border and outliers

= 10, MinPts = 4
6

Density-reachability

• Directly density-reachable
• An object q is directly density-reachable from object p
if p is a core object and q is in p’s -neighborhood.

• q is directly density-reachable from p


ε ε • p is not directly density-reachable from
q p q
• Density-reachability is asymmetric
MinPts = 4
7

10
15-06-21

Density-reachability

• Density-Reachable (directly and indirectly):


– A point p is directly density-reachable from p2
– p2 is directly density-reachable from p1
– p1 is directly density-reachable from q
– p p2 p1 q form a chain

p
• p is (indirectly) density-reachable
p2 from q
p1 • q is not density-reachable from p
q
MinPts = 7
DBSCAN Algorithm: Example 8

• Parameter
• = 2 cm
• MinPts = 3
DBSCAN Algorithm

for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE

11

11
15-06-21

Example of Clustering Comparison

Clustering Validation

12
15-06-21

Clustering Validation

• How good the clustering process was?


• Are of
Notion the clusterscan
a Cluster valid?
be Ambiguous
• Which is the right number of clusters?

How many clusters? Six Clusters

Two Clusters Four Clusters

Master Bioinformatics, Freie Universitaet


Matteo Pardo
Clustering 7-6-10 - 8 -

Clustering Validation
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e.,
distinguishing whether non-random structure actually exists
in the data.
2. Comparing the results of a cluster analysis to externally
known results, e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the
data without reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster
analyses to determine which is better.
5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want


to evaluate the entire clustering or just individual clusters.

13
15-06-21

Clustering
Measures of Cluster Validation
Validity
 Numerical measures that are applied to judge various
aspects of cluster validity, are classified into the following
three types.
 External Index: Used to measure the extent to which cluster
labels match externally supplied class labels.
 Entropy
 Internal Index: Used to measure the goodness of a clustering
structure without respect to external information.
 Sum of Squared Error (SSE)
 Relative Index: Used to compare two different clusterings or
clusters.
 Often an external or internal index is used for this function, e.g., SSE or
entropy
 Sometimes these are referred to as criteria instead of indices
 However, sometimes criterion is the general strategy and index is the
numerical measure that implements the criterion.

What is Cluster Analysis?


Clustering Validation
o Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or
unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

Master Bioinformatics, Freie Universitaet


Matteo Pardo
Clustering 7-6-10 - 7 -

14
The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around
zero indicate overlapping clusters.
The score is higher when clusters are dense and well separated, which relates to a standard concept of a
cluster.

2.3.10.5.2. Drawbacks 15-06-21


The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as
density based clusters like those obtained through DBSCAN.

2.3. Clustering — scikit-learn 0.21.2 documentation


Examples: https://scikit-learn.org/stable/modules/clustering.html

Selecting the number of clusters with silhouette analysis on KMeans clustering : In this example the
silhouette analysis is used to choose an optimal value for n_clusters.

Clustering Validation
2.3.10.6. Calinski-Harabasz Index
If the ground truth labels are not known
»
If the ground truth labels are not known, the Calinski-Harabasz index
with be the number of points in our data, be the set of points in cluster , be the center of cluster , be the
(sklearn.metrics.calinski_harabasz_score) - also known as the Variance Ratio Criterion - can be used to
center of , be the number of points in Calinski-Harabasz
cluster . Index
evaluate the model, where a higher Calinski-Harabasz score relates to a model with better defined clusters. >>>
>>> from sklearn import metrics
>>> from sklearn.metrics import pairwise_distances
For>>> clusters, the Calinski-Harabasz
from sklearn import datasets score is given as the ratio of the between-clusters dispersion mean and the
>>> dataset = datasets.load_iris()
within-cluster dispersion:
>>> X = dataset.data
>>> y = dataset.target

In normal usage, the Calinski-Harabasz index is applied to the results of a cluster analysis.
>>> import numpy as np >>>
>>> from sklearn.cluster import KMeans
2.3. Clustering
where — scikit-learn 0.21.2 documentation
is the between group dispersion matrix and https://scikit-learn.org/stable/modules/clustering.html
is the within-cluster dispersion matrix defined by:
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X) Next
>>> labels = kmeans_model.labels_
>>> metrics.calinski_harabasz_score(X, labels)
561.62...
29 of 32 2019-05-30, 16:01
2.3.10.6.1. Advantages

The score is higher when clusters are dense and well separated, which relates to a standard concept of a
»
cluster.
withThe be the is
score number
fast to of points in our data,
compute be the set of points in cluster , be the center of cluster , be the
center of , be the number of points in cluster .
>>>
2.3.10.6.2.
>>> fromDrawbacks
sklearn import metrics
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn import datasets
>>> dataset = datasets.load_iris()
The Calinski-Harabasz index is generally higher for convex clusters
>>> X = dataset.data
than other concepts of clusters, such as
density
>>> based clusters like those obtained through DBSCAN.
y = dataset.target

In normal usage, the Calinski-Harabasz index is applied to the results of a cluster analysis.
References
>>> import numpy as np >>>
>>>Caliński, T., & Harabasz,import
from sklearn.cluster J. (1974). “A dendrite method for cluster analysis”.
KMeans Communications in Statistics-
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
>>>theory
labelsand Methods 3: 1-27. doi:10.1080/03610926.2011.560741.
= kmeans_model.labels_

Clustering Validation
>>> metrics.calinski_harabasz_score(X, labels)
561.62...

2.3.10.7. Davies-Bouldin Index


If the ground truth labels are not known
2.3.10.6.1. Advantages
If the ground truth labels are not known, the Davies-Bouldin index (sklearn.metrics.davies_bouldin_score) can
be used to
Theevaluate
score isthe model,
higher where
when a lower
clusters areDavies-Bouldin
dense and wellindex relateswhich
separated, to a model
relateswith
to abetter separation
standard conceptbetween
of a
the clusters.
cluster.
Davies-Bouldin Index
The score is fast to compute
The index is defined as the average similarity between each cluster for and its most similar one .
In the context of this index, similarity is defined as a measure that trades off:
2.3.10.6.2. Drawbacks
, the average distance between each point of cluster and the centroid of that cluster – also know as cluster
The Calinski-Harabasz index is generally higher for convex clusters than other concepts of clusters, such as
diameter.
2.3. Clustering — scikit-learn 0.21.2 documentation https://scikit-learn.org/stable/modules/clustering.html
density based clusters
, the distance betweenlike those
cluster obtained through
centroids and . DBSCAN. Next

References
A simple choice to construct so that it is nonnegative and symmetric is:
30 of 32 2019-05-30, 16:01
Caliński, T., & Harabasz, J. (1974). “A dendrite method for cluster analysis”. Communications in Statistics-
theory and Methods 3: 1-27. doi:10.1080/03610926.2011.560741.

» Then the Davies-Bouldin index is defined as:


2.3.10.7. Davies-Bouldin Index

If the ground truth labels are not known, the Davies-Bouldin index (sklearn.metrics.davies_bouldin_score) can
be used to evaluate the model, where a lower Davies-Bouldin index relates to a model with better separation between
the clusters. Zero is the lowest possible score. Values closer to zero indicate a better partition.

The index is In normal usage, the Davies-Bouldin index is applied to the results of a cluster analysis as follows:
defined as the average similarity between each cluster for and its most similar one .
>>>this
from sklearn import datasets >>>
In the context of index, similarity is defined as a measure that trades off:
>>> iris = datasets.load_iris()
>>> X = iris.data
, the average
>>> fromdistance between each
sklearn.cluster point
import of cluster and the centroid
KMeans of that cluster – also know as cluster
>>> from sklearn.metrics import davies_bouldin_score
diameter.>>> kmeans = KMeans(n_clusters=3, random_state=1).fit(X)
, the >>> labels
distance = kmeans.labels_
between cluster centroids and . Next
>>> davies_bouldin_score(X, labels)
0.6619...

30 of 32 2019-05-30, 16:01
15
2.3.10.7.1. Advantages

The computation of Davies-Bouldin is simpler than that of Silhouette scores.


The index is computed only quantities and features inherent to the dataset.
15-06-21

Clustering Validation
If the ground truth labels are not known
Evaluating partitions by Sum of Squared Error (SSE)
Sum of Squared Error (SSE)

K
SSE = ∑ ∑ dist 2 ( mi , x )
i =1 x∈Ci

x is a data point in cluster Ci and mi is the representative


point for cluster Ci
o Given two partitions, we can choose the one with the
smallest error
o One easy way to reduce SSE is to increase K, the number
of clusters
• For deciding cluster number: look at the knee in SSE
• Still: A good clustering with smaller K can have a lower SSE than
a poor clustering with higher K

Master Bioinformatics, Freie Universitaet


Matteo Pardo
Clustering 7-6-10 - 28 -

Clustering Validation
If the ground truth labels are not known
Internal Measures: SSE
Evaluating partitions by Sum of Squared Error (SSE)
Sum of Squared Error (SSE)
 Clusters in more complicated figures aren’t well separated
 Internal Index: Used toKmeasure 2the goodness of a
clustering structure =
SSEwithout dist ( m
respect ∑∑
toi ,external
x) information
i =1 x∈Ci
 SSE
 SSE is good for comparing two clusterings or two clusters
x is aSSE).
(average data point in cluster Ci and mi is the representative
 point for cluster
Can also be used Ci
to estimate the number of clusters
o Given two partitions, we can choose the one with the
10

6 9

4
smallest error 8

o One easy way to reduce SSE is to increase K, the number


7

2 6

of clusters
SSE

5
0
4
-2 • For deciding cluster number: look3 at the knee in SSE
-4 • Still: A good clustering with smaller
2 K can have a lower SSE than
-6
a poor clustering with higher K 1
0
2 5 10 15 20 25 30
5 10 15
K

Master Bioinformatics, Freie Universitaet


Matteo Pardo
Clustering 7-6-10 - 28 -

16
2.3.10.4.2. Drawbacks

Contrary to inertia, FMI-based measures require the knowledge of the ground truth classes while almost
never available in practice or requires manual assignment by human annotators (as in the supervised learning
setting).
15-06-21

References

E. B. Fowkles and C. L. Mallows, 1983. “A method for comparing two hierarchical clusterings”. Journal of the
American Statistical Association. http://wildfire.stat.ucla.edu/pdflibrary/fowlkes.pdf
CLUSTERING VALIDATION TECHNIQUES 125
Internal Measures: Silhouette Coefficient
Wikipedia entry for the Fowlkes-Mallows Index

 Silhouette Coefficient combine ideas of both cohesion and


2.3.10.5. Silhouetteseparation,
Clustering Validation
Coefficient but for individual points, as well as clusters and
clusterings
If
 the
If the ground truth labels
For ground
an
are not truthmust
individual
known, evaluation labels
point, i areusing
be performed nottheknown
model itself. The Silhouette
Coefficient (sklearn.metrics.silhouette_score) is an example of such an evaluation, where a higher Silhouette
 Calculate a = average distance of i to the points in its cluster
Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample
 Calculate b = min (average distance of i to points in another cluster)
and is composed of two scores: Silouette Coefficient
 The silhouette coefficient for a point is then given by
a: The mean distance between a sample and all other points in the same class.
s = 1 a–sample
b: The mean distance between a/b and
if aall<other
b, points
(or sin=the
b/anext
- 1 nearest cluster.
if a ≥ b, not the usual case)
b
The Silhouette Coefficient s for a single sample is then given as:
 Typically between 0 and 1. a
 The closer to 1 the better.

The Silhouette Coefficient for a set of samples is given as the mean of the Silhouette Coefficient for0 each sample.
Figure 3. Confidence interval for (a) two-tailed index, (b) right-tailed index, (c) left-tailed index, where qp is
Can calculate the Average Silhouette width for a cluster or a
 of q under hypothesis H0 . (Theodoridis and Koutroubas, 1999).
the ρ proportion Next

clustering
The Silhouette coefficient can be calculated for a cluster or for a clustering.
28 of 32 Assuming that this shape is right-tailed (figure 3b) and that we have generated the scatter- 2019-05-30, 16:01
plot using r values of the index q, called qi , in order to accept or reject the Null Hypothesis
H0 we examine the following conditions (Theodoridis and Koutroubas, 1999):

We reject (accept) H0 If q’s value for our data set, is gr-


eater (smaller) than (1->)·r of qi values, of the respective
synthetic data sets Xi .
Assuming that the shape is left-tailed (figure 3c), we rej-
ect (accept) H0 if q’s value for 126our data set, is smaller HALKIDI, BATISTAKIS AND VAZ
(greater) than > · r of qi values.
Assuming that the shape is two-tailed (figure 3a) we acc-
• SS:
ept H0 if q is greater than (>/2)·r if bothofpoints
number to the same cluster of the clustering structure C and
belong and
qi values
group of partition P.
smaller than (1- >/2)·r of qi values.
• SD: if points belong to the same cluster of C and to different groups of P.
Based on the external criteria we can work in two•different
DS: ifways. Firstly,
points we can
belong toevaluate
different clusters of C and to the same group of P.
126 Clustering Validation
the resulting clustering structure C, by comparing it to an independent partition of the data

P built according to our intuition about the HALKIDI, DD: if both
BATISTAKIS
clustering structure points belong
of the dataAND to different clusters of C and to different groups
VAZIRGIANNIS
set. Secondly, o
If the ground truth labels are known
we can compare the proximity matrix P to the partition P.
Assuming now that a, b, c and d are the number of SS, SD, DS and DD pairs r
• SS: if both points
4.3.1.2. belong oftoCthe
Comparison withsame cluster
partition P (notof
forthe clustering
hierarchy
then + cstructure MCwhich
+ d =Consider
a +ofbclustering). andCto=isthe
thesame
maximum number of all pairs in the data se
{C
group of126 1 · · ·C m } isP.
partition and P = {P
a clustering structure of a data set XHALKIDI, 1 · · ·Ps } is a defined partition
of 126
M = N (N
BATISTAKIS
xu ) from BATISTAKIS
the data. We refer to a pair of points (xv, HALKIDI,
− 1)/2
the data set AND
where
AND N is the
VAZIRGIANNIS
usingVAZIRGIANNIS
the following
total number of points in the data set).
• SD: if points
terms: belong to the same cluster of C and to Nowdifferent
we can groups
defineofthe P. following indices to measure the degree of similarity
• DS: if points
• SS: belong to different
if both points belong toclusters
the sameof C and
cluster and toP:
of the the group ofCP.
same structure
clustering and to the same
• DD: if both SS: ifofboth
• points
group points belong
belong
partition to the same
to different
P. cluster of
clusters oftheCclustering structure C and
and to different to the of
groups sameP. (~TP)
group of partition P.
• SD: if points belong to the same cluster of C and to different groups of P.
• SD: if points belong to the same cluster of C and Rand
• and
to different groups ofRP.= (a + d)/M,
Statistic:
• DS:
Assuming •nowif if
points
DS: that a,belong
points b,
belong
to
c and different
d are the
to different
clusters
number
clusters
C
of of
of C and SS, to the
SD,
to the same
same
DS and
group
group
ofDD
of P.
P. pairs respectively,
• DD: • Jaccard
C and Coefficient: J = a/(a +(~TN)
b + c),
• DD:if ifboth
bothpoints
points belong
belong totodifferent
differentclusters
clusters ofand
of C to different of P. of P.
groupsgroups
then a + b + c + d = M which is the maximum number of all pairs in the data set (meaning,
to different
M = N (N −Assuming
1)/2 where
Assuming nowN
now isa
that
that the
a b, cctotal
,, b, and number
andddarearethethe of points
number
number of SS,
of The
SS, SD,inSD, theand
DS data
DS set).
andpairs
DD DDrespectively,
pairstake
respectively,
above two indices values between 0 and 1, and are maximized whe
Now wethen
canadefine
then+a+b+b +cthe
c++ddfollowing
==M which isindices
M which isthe
themaximumto measure
maximum number
number ofthe degree
allofpairs in theof
all pairs in similarity
the
data data
set between C
set (meaning,
(meaning,
MM N (N−−1)/2
N (N 1)/2where
where N N isisthe
thetotal
totalnumber
number
Another
of points in the
of points
index
indata
is
the set).
the:
data set).
and P: ==
NowNowwewecan
candefine
define the
the following
followingindices to to
indices measure the degree
measure of similarity
the degree between
of similarity C
between C
P: P:
andand • Folkes and Mallows index:
• Rand Statistic: R = (a + d)/M,
• Rand Statistic:
• Jaccard•Coefficient: J= = (a ++
R a/(a d)/M,
b + c), √ a a
!
Rand Statistic: R= (a + d)/M,
• Jaccard Coefficient: J = a/(a + b + c),
• Jaccard Coefficient: J = a/(a + b + c), FM = a/ m 1 m 2 = ·
a+b a+c
The above two
The indices
above twotake values
indices between
take values 0 0and
between and 1, andareare
1, and maximized
maximized when mwhen
= s. m = s.
Theisabove
Another
Another index twois indices
index
the: the: take values between 0 and 1, and are maximized when m = s.
Another index is the: where m 1 = (a + b), m 2 = (a + c).
• Folkes and Mallows index: For the previous three indices it has been proven that high values of indices in
• Folkes and Mallows
• Folkes index: index:
and Mallows similarity between C and P. The higher the values of these indices are the mor
√ a a
!
FM = a/ m! 1m2 = ! · and P are. Other indices are: (2)
√ √ a a +ab a ·a + ac
FM = a/ FMm= m
1 2
a/ = m 1 m 2 = · (2)(2)
where m 1 = (a + b), a = b(aa++
m 2+ b ca + c
ac).+ • Huberts ! statistic:
For the previous three indices it has been proven that high values of indices indicate great
where (a + b),C m
m 1 =between P.(a c). the values of these indices are the more similar C
similarity
where m 1 = For
(a +
and theb),
P are. m 2 =indices
previous
Other three
2=
and
(a +indices
+higher
The N −1 "
c). it has been proven that high values of indices
are: " N
indicate great
17
For the previous three indices it has been proven that high ! = (1/M)
values of
similarity between C and P. The higher the values of these indices are the more indices X (i, j)Y
indicate (i, j)
similargreat
C
and• PHuberts
similarity betweenare. C statistic:
!and
Other P. Theare:
indices higher the values of these indices are i=1 the j=i+1
more similar C
and P are. Other indices are:"
N −1 N
High values of this index indicate a strong similarity between X and Y .
"

You might also like