0% found this document useful (0 votes)

60 views17 pages

Clustering 2

The document discusses different clustering algorithms. It begins by explaining that clustering depends on selected features and distances and validity measures. It then describes the two main types of clustering algorithms: partitional clustering which divides data into non-overlapping clusters, and hierarchical clustering which produces nested clusters in a tree structure. The document focuses on k-means clustering, describing how it works and its limitations, such as differing cluster sizes, densities, and non-globular shapes. It also covers hierarchical clustering, explaining agglomerative and divisive approaches and how traditional algorithms use a similarity matrix to merge or split clusters.

Uploaded by

daniel.olea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views17 pages

Clustering 2

Uploaded by

daniel.olea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

15-06-21

Clustering

Clustering depends
on the selected features
! ! !
s p x ∈ℜn
Pre Extracción (Cálculo)
Sensado Clustering
Procesamiento de Características

n: número de
características

Mundo
Real

1
15-06-21

After feature selection, clustering depends

on distances & validity measures
! ! !
s p x ∈ℜn
Pre Extracción (Cálculo)
Sensado Clustering
Procesamiento de Características

n: número de
características
Notion of a Cluster can be Ambiguous

Mundo How many clusters? Six Clusters

Real

Two Clusters Four Clusters

Master Bioinformatics, Freie Universitaet

Matteo Pardo
Clustering 7-6-10 - 8 -

Clustering algorithms
o Clustering algorithm comes in 2 basic flavors:
• Partitional Clustering: a division data objects into non-overlapping
Clustering Methods
subsets (clusters) such that each data object is in exactly one subset
• Hierarchical clustering: a set of nested clusters organized as a
hierarchical tree
Hierarchical
Partitioning
Partitioning Density-based

Clustering algorithms
o Clustering algorithm comes in 2 basic flavors:
• Partitional Clustering: a division data objects into non-overlapping
subsets (clusters) such that each data object is in exactly one subset
• Hierarchical clustering: a set of nested clusters organized as a
hierarchical tree
Hierarchical
Hierarchical
Partitioning
Master Bioinformatics, Freie Universitaet
Matteo Pardo
Clustering 7-6-10 - 24 -

Master Bioinformatics, Freie Universitaet

Matteo Pardo
Clustering 7-6-10 - 24 -

2
15-06-21

Partitioning Clustering

K-means Clustering

o Partitional clustering approach

o Number of clusters, K, must be specified
K-Means

Master Bioinformatics, Freie Universitaet

Matteo Pardo
Clustering 7-6-10 - 26 -

3
15-06-21

K-means Clustering K-Means

– Details

o Initial centroids are often chosen randomly.

• Clusters produced vary from one run to another.
o The centroid is (typically) the mean of the points in the
cluster.
o ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
o Most of the convergence happens in the first few
iterations.
• Often the stopping condition is changed to ‘Until relatively few points
change clusters’
o Complexity is O( n * K * I * d )
• n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

Master Bioinformatics, Freie Universitaet

Matteo Pardo
Clustering 7-6-10 - 27 -

Limitations of K-means: Differing Sizes Limitations of K-means: Differing Density

K-means (3 Clusters) Original Points K-means (3 Clusters)

Original Points

Limitations
Master Bioinformatics,of
FreieK-means:
Universitaet Non-globular
Matteo Pardo Shapes Master Bioinformatics, Freie Universitaet
Clustering 7-6-10 - 42 -
Matteo Pardo
Clustering 7-6-10 - 41 -

Original Points K-means (2 Clusters)

Master Bioinformatics, Freie Universitaet

Matteo Pardo
Clustering 7-6-10 - 43 -

4
15-06-21

Hierarchical Clustering

Hierarchical Clustering
• Produces a set of nested clusters
organized as a hierarchical tree
• Can be visualized as a dendrogram
– A tree-like diagram that records the
sequences of merges or splits
6 5
0.2
4
3 4
0.15 2
5
0.1 2

0.05 1
3 1

0
1 3 2 5 4 6

5
15-06-21

Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left

– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or
there are k clusters)

• Traditional hierarchical algorithms use a similarity or

distance matrix
– Merge or split one cluster at a time

Complexity of hierarchical clustering

• Distance matrix is used for deciding which

clusters to merge/split

• At least quadratic in the number of data

points

• Not usable for large datasets

6
15-06-21

Agglomerative Clustering Algorithm

o More popular hierarchical clustering technique
Agglomerative Clustering
o Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. RepeatHierarchical Clustering
4. Merge the two closest clusters
o Produces a set of nested clusters organized as a
5. Update the proximity
hierarchical tree matrix
6. Until only
o Cana be
single cluster
visualized as remains
a dendrogram
• A tree like diagram that records the sequences of merges or
Cálculo de la matriz de proximidad
o
Proximity Matrix Key operation is the computation of the proximity of
splits

p1 p2 p3
two
p4 p5
clusters
... 6 5
0.2

p1 4
Similaridad? 0.15 3 4
p2 2
5
0.1
p3 2
0.05
p4 1
3 1
p5 0
Master Bioinformatics, Freie Universitaet
1 3 2 5 4 6

. Matteo Pardo
Clustering 7-6-10 - 51 -
.
MIN
how
. to measure the distance between clusters is relevant
Master Bioinformatics, Freie Universitaet
(min,
Matteo Pardo max, mean, etc.)
MAX Matriz de proximidad Clustering 7-6-10 - 48 -

Promedio del grupo

Distancia entre centroides
Marta Zorrilla - Universidad de Cantabria 2006/07 64

How to Define Inter-Cluster Similarity How to Define Inter-Cluster Similarity

o MIN o MIN
o MAX o MAX
o Group Average o Group Average
o Distance Between Centroids o Distance Between Centroids

How to Define Inter-Cluster Similarity How to Define Inter-Cluster Similarity

Master Bioinformatics, Freie Universitaet

Master Bioinformatics, Freie Universitaet Matteo Pardo
Matteo Pardo Clustering 7-6-10 - 54 -
Clustering 7-6-10 - 53 - × ×

o MIN
o MIN
o MAX o MAX
o Group Average o Group Average
o Distance Between Centroids o Distance Between Centroids
Other methods driven by an objective function:
E.g. Ward linkage:
• Similarity based on the increase in SSE when two clusters are merged
• Similar to group average (uses all points)
• HC equivalent to k-means
Master Bioinformatics, Freie Universitaet
Matteo Pardo
Clustering 7-6-10 - 55 - Master Bioinformatics, Freie Universitaet
Matteo Pardo
Clustering 7-6-10 - 56 -

7
15-06-21

Density-based Clustering

DBSCAN

Taken from Jing Gao SUNY Buffalo

8
15-06-21

Density Definition

• -Neighborhood – Objects within a radius of from

an object.
N ( p) : {q | d ( p, q) }
• “High density” - ε-Neighborhood of an object contains
at least MinPts of objects.

ε-Neighborhood of p
ε ε ε-Neighborhood of q
q p
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)

Core, Border & Outlier

Outlier Given and MinPts,

categorize the objects into
Border three exclusive groups.

A point is a core point if it has more than a

Core specified number of points (MinPts) within
Eps—These are points that are at the
interior of a cluster.

A border point has fewer than MinPts

= 1unit, MinPts = 5 within Eps, but is in the neighborhood
of a core point.

A noise point is any point that is not a

core point nor a border point.
(noise points or outliers)
5

9
15-06-21

Example

Original Points Point types: core,

border and outliers

= 10, MinPts = 4
6

Density-reachability

• Directly density-reachable
• An object q is directly density-reachable from object p
if p is a core object and q is in p’s -neighborhood.

• q is directly density-reachable from p

ε ε • p is not directly density-reachable from
q p q
• Density-reachability is asymmetric
MinPts = 4
7

10
15-06-21

Density-reachability

• Density-Reachable (directly and indirectly):

– A point p is directly density-reachable from p2
– p2 is directly density-reachable from p1
– p1 is directly density-reachable from q
– p p2 p1 q form a chain

p
• p is (indirectly) density-reachable
p2 from q
p1 • q is not density-reachable from p
q
MinPts = 7
DBSCAN Algorithm: Example 8

• Parameter
• = 2 cm
• MinPts = 3
DBSCAN Algorithm

for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE

11
15-06-21

Example of Clustering Comparison

Clustering Validation

12
15-06-21

Clustering Validation

• How good the clustering process was?

• Are of
Notion the clusterscan
a Cluster valid?
be Ambiguous
• Which is the right number of clusters?

How many clusters? Six Clusters

Two Clusters Four Clusters

Master Bioinformatics, Freie Universitaet

Matteo Pardo
Clustering 7-6-10 - 8 -

Clustering Validation
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e.,
distinguishing whether non-random structure actually exists
in the data.
2. Comparing the results of a cluster analysis to externally
known results, e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the
data without reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster
analyses to determine which is better.
5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want

to evaluate the entire clustering or just individual clusters.

13
15-06-21

Clustering
Measures of Cluster Validation
Validity
 Numerical measures that are applied to judge various
aspects of cluster validity, are classified into the following
three types.
 External Index: Used to measure the extent to which cluster
labels match externally supplied class labels.
 Entropy
 Internal Index: Used to measure the goodness of a clustering
structure without respect to external information.
 Sum of Squared Error (SSE)
 Relative Index: Used to compare two different clusterings or
clusters.
 Often an external or internal index is used for this function, e.g., SSE or
entropy
 Sometimes these are referred to as criteria instead of indices
 However, sometimes criterion is the general strategy and index is the
numerical measure that implements the criterion.

What is Cluster Analysis?

Clustering Validation
o Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or
unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

Master Bioinformatics, Freie Universitaet

Matteo Pardo
Clustering 7-6-10 - 7 -

14
The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around
zero indicate overlapping clusters.
The score is higher when clusters are dense and well separated, which relates to a standard concept of a
cluster.

2.3.10.5.2. Drawbacks 15-06-21

The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as
density based clusters like those obtained through DBSCAN.

2.3. Clustering — scikit-learn 0.21.2 documentation

Examples: https://scikit-learn.org/stable/modules/clustering.html

Selecting the number of clusters with silhouette analysis on KMeans clustering : In this example the
silhouette analysis is used to choose an optimal value for n_clusters.

Clustering Validation
2.3.10.6. Calinski-Harabasz Index
If the ground truth labels are not known
»
If the ground truth labels are not known, the Calinski-Harabasz index
with be the number of points in our data, be the set of points in cluster , be the center of cluster , be the
(sklearn.metrics.calinski_harabasz_score) - also known as the Variance Ratio Criterion - can be used to
center of , be the number of points in Calinski-Harabasz
cluster . Index
evaluate the model, where a higher Calinski-Harabasz score relates to a model with better defined clusters. >>>
>>> from sklearn import metrics
>>> from sklearn.metrics import pairwise_distances
For>>> clusters, the Calinski-Harabasz
from sklearn import datasets score is given as the ratio of the between-clusters dispersion mean and the
>>> dataset = datasets.load_iris()
within-cluster dispersion:
>>> X = dataset.data
>>> y = dataset.target

In normal usage, the Calinski-Harabasz index is applied to the results of a cluster analysis.
>>> import numpy as np >>>
>>> from sklearn.cluster import KMeans
2.3. Clustering
where — scikit-learn 0.21.2 documentation
is the between group dispersion matrix and https://scikit-learn.org/stable/modules/clustering.html
is the within-cluster dispersion matrix defined by:
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X) Next
>>> labels = kmeans_model.labels_
>>> metrics.calinski_harabasz_score(X, labels)
561.62...
29 of 32 2019-05-30, 16:01
2.3.10.6.1. Advantages

The score is higher when clusters are dense and well separated, which relates to a standard concept of a
»
cluster.
withThe be the is
score number
fast to of points in our data,
compute be the set of points in cluster , be the center of cluster , be the
center of , be the number of points in cluster .
>>>
2.3.10.6.2.
>>> fromDrawbacks
sklearn import metrics
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn import datasets
>>> dataset = datasets.load_iris()
The Calinski-Harabasz index is generally higher for convex clusters
>>> X = dataset.data
than other concepts of clusters, such as
density
>>> based clusters like those obtained through DBSCAN.
y = dataset.target

In normal usage, the Calinski-Harabasz index is applied to the results of a cluster analysis.
References
>>> import numpy as np >>>
>>>Caliński, T., & Harabasz,import
from sklearn.cluster J. (1974). “A dendrite method for cluster analysis”.
KMeans Communications in Statistics-
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
>>>theory
labelsand Methods 3: 1-27. doi:10.1080/03610926.2011.560741.
= kmeans_model.labels_

Clustering Validation
>>> metrics.calinski_harabasz_score(X, labels)
561.62...

2.3.10.7. Davies-Bouldin Index

If the ground truth labels are not known
2.3.10.6.1. Advantages
If the ground truth labels are not known, the Davies-Bouldin index (sklearn.metrics.davies_bouldin_score) can
be used to
Theevaluate
score isthe model,
higher where
when a lower
clusters areDavies-Bouldin
dense and wellindex relateswhich
separated, to a model
relateswith
to abetter separation
standard conceptbetween
of a
the clusters.
cluster.
Davies-Bouldin Index
The score is fast to compute
The index is defined as the average similarity between each cluster for and its most similar one .
In the context of this index, similarity is defined as a measure that trades off:
2.3.10.6.2. Drawbacks
, the average distance between each point of cluster and the centroid of that cluster – also know as cluster
The Calinski-Harabasz index is generally higher for convex clusters than other concepts of clusters, such as
diameter.
2.3. Clustering — scikit-learn 0.21.2 documentation https://scikit-learn.org/stable/modules/clustering.html
density based clusters
, the distance betweenlike those
cluster obtained through
centroids and . DBSCAN. Next

References
A simple choice to construct so that it is nonnegative and symmetric is:
30 of 32 2019-05-30, 16:01
Caliński, T., & Harabasz, J. (1974). “A dendrite method for cluster analysis”. Communications in Statistics-
theory and Methods 3: 1-27. doi:10.1080/03610926.2011.560741.

» Then the Davies-Bouldin index is defined as:

2.3.10.7. Davies-Bouldin Index

If the ground truth labels are not known, the Davies-Bouldin index (sklearn.metrics.davies_bouldin_score) can
be used to evaluate the model, where a lower Davies-Bouldin index relates to a model with better separation between
the clusters. Zero is the lowest possible score. Values closer to zero indicate a better partition.

The index is In normal usage, the Davies-Bouldin index is applied to the results of a cluster analysis as follows:
defined as the average similarity between each cluster for and its most similar one .
>>>this
from sklearn import datasets >>>
In the context of index, similarity is defined as a measure that trades off:
>>> iris = datasets.load_iris()
>>> X = iris.data
, the average
>>> fromdistance between each
sklearn.cluster point
import of cluster and the centroid
KMeans of that cluster – also know as cluster
>>> from sklearn.metrics import davies_bouldin_score
diameter.>>> kmeans = KMeans(n_clusters=3, random_state=1).fit(X)
, the >>> labels
distance = kmeans.labels_
between cluster centroids and . Next
>>> davies_bouldin_score(X, labels)
0.6619...

30 of 32 2019-05-30, 16:01
15
2.3.10.7.1. Advantages

The computation of Davies-Bouldin is simpler than that of Silhouette scores.

The index is computed only quantities and features inherent to the dataset.
15-06-21

Clustering Validation
If the ground truth labels are not known
Evaluating partitions by Sum of Squared Error (SSE)
Sum of Squared Error (SSE)

K
SSE = ∑ ∑ dist 2 ( mi , x )
i =1 x∈Ci

x is a data point in cluster Ci and mi is the representative

point for cluster Ci
o Given two partitions, we can choose the one with the
smallest error
o One easy way to reduce SSE is to increase K, the number
of clusters
• For deciding cluster number: look at the knee in SSE
• Still: A good clustering with smaller K can have a lower SSE than
a poor clustering with higher K

Master Bioinformatics, Freie Universitaet

Matteo Pardo
Clustering 7-6-10 - 28 -

Clustering Validation
If the ground truth labels are not known
Internal Measures: SSE
Evaluating partitions by Sum of Squared Error (SSE)
Sum of Squared Error (SSE)
 Clusters in more complicated figures aren’t well separated
 Internal Index: Used toKmeasure 2the goodness of a
clustering structure =
SSEwithout dist ( m
respect ∑∑
toi ,external
x) information
i =1 x∈Ci
 SSE
 SSE is good for comparing two clusterings or two clusters
x is aSSE).
(average data point in cluster Ci and mi is the representative
 point for cluster
Can also be used Ci
to estimate the number of clusters
o Given two partitions, we can choose the one with the
10

6 9

4
smallest error 8

o One easy way to reduce SSE is to increase K, the number

2 6

of clusters
SSE

5
0
4
-2 • For deciding cluster number: look3 at the knee in SSE
-4 • Still: A good clustering with smaller
2 K can have a lower SSE than
-6
a poor clustering with higher K 1
0
2 5 10 15 20 25 30
5 10 15
K

Master Bioinformatics, Freie Universitaet

Matteo Pardo
Clustering 7-6-10 - 28 -

16
2.3.10.4.2. Drawbacks

Contrary to inertia, FMI-based measures require the knowledge of the ground truth classes while almost
never available in practice or requires manual assignment by human annotators (as in the supervised learning
setting).
15-06-21

References

E. B. Fowkles and C. L. Mallows, 1983. “A method for comparing two hierarchical clusterings”. Journal of the
American Statistical Association. http://wildfire.stat.ucla.edu/pdflibrary/fowlkes.pdf
CLUSTERING VALIDATION TECHNIQUES 125
Internal Measures: Silhouette Coefficient
Wikipedia entry for the Fowlkes-Mallows Index

 Silhouette Coefficient combine ideas of both cohesion and

2.3.10.5. Silhouetteseparation,
Clustering Validation
Coefficient but for individual points, as well as clusters and
clusterings
If
 the
If the ground truth labels
For ground
an
are not truthmust
individual
known, evaluation labels
point, i areusing
be performed nottheknown
model itself. The Silhouette
Coefficient (sklearn.metrics.silhouette_score) is an example of such an evaluation, where a higher Silhouette
 Calculate a = average distance of i to the points in its cluster
Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample
 Calculate b = min (average distance of i to points in another cluster)
and is composed of two scores: Silouette Coefficient
 The silhouette coefficient for a point is then given by
a: The mean distance between a sample and all other points in the same class.
s = 1 a–sample
b: The mean distance between a/b and
if aall<other
b, points
(or sin=the
b/anext
- 1 nearest cluster.
if a ≥ b, not the usual case)
b
The Silhouette Coefficient s for a single sample is then given as:
 Typically between 0 and 1. a
 The closer to 1 the better.

The Silhouette Coefficient for a set of samples is given as the mean of the Silhouette Coefficient for0 each sample.
Figure 3. Confidence interval for (a) two-tailed index, (b) right-tailed index, (c) left-tailed index, where qp is
Can calculate the Average Silhouette width for a cluster or a
 of q under hypothesis H0 . (Theodoridis and Koutroubas, 1999).
the ρ proportion Next

clustering
The Silhouette coefficient can be calculated for a cluster or for a clustering.
28 of 32 Assuming that this shape is right-tailed (figure 3b) and that we have generated the scatter- 2019-05-30, 16:01
plot using r values of the index q, called qi , in order to accept or reject the Null Hypothesis
H0 we examine the following conditions (Theodoridis and Koutroubas, 1999):

We reject (accept) H0 If q’s value for our data set, is gr-

eater (smaller) than (1->)·r of qi values, of the respective
synthetic data sets Xi .
Assuming that the shape is left-tailed (figure 3c), we rej-
ect (accept) H0 if q’s value for 126our data set, is smaller HALKIDI, BATISTAKIS AND VAZ
(greater) than > · r of qi values.
Assuming that the shape is two-tailed (figure 3a) we acc-
• SS:
ept H0 if q is greater than (>/2)·r if bothofpoints
number to the same cluster of the clustering structure C and
belong and
qi values
group of partition P.
smaller than (1- >/2)·r of qi values.
• SD: if points belong to the same cluster of C and to different groups of P.
Based on the external criteria we can work in two•different
DS: ifways. Firstly,
points we can
belong toevaluate
different clusters of C and to the same group of P.
126 Clustering Validation
the resulting clustering structure C, by comparing it to an independent partition of the data
•
P built according to our intuition about the HALKIDI, DD: if both
BATISTAKIS
clustering structure points belong
of the dataAND to different clusters of C and to different groups
VAZIRGIANNIS
set. Secondly, o
If the ground truth labels are known
we can compare the proximity matrix P to the partition P.
Assuming now that a, b, c and d are the number of SS, SD, DS and DD pairs r
• SS: if both points
4.3.1.2. belong oftoCthe
Comparison withsame cluster
partition P (notof
forthe clustering
hierarchy
then + cstructure MCwhich
+ d =Consider
a +ofbclustering). andCto=isthe
thesame
maximum number of all pairs in the data se
{C
group of126 1 · · ·C m } isP.
partition and P = {P
a clustering structure of a data set XHALKIDI, 1 · · ·Ps } is a defined partition
of 126
M = N (N
BATISTAKIS
xu ) from BATISTAKIS
the data. We refer to a pair of points (xv, HALKIDI,
− 1)/2
the data set AND
where
AND N is the
VAZIRGIANNIS
usingVAZIRGIANNIS
the following
total number of points in the data set).
• SD: if points
terms: belong to the same cluster of C and to Nowdifferent
we can groups
defineofthe P. following indices to measure the degree of similarity
• DS: if points
• SS: belong to different
if both points belong toclusters
the sameof C and
cluster and toP:
of the the group ofCP.
same structure
clustering and to the same
• DD: if both SS: ifofboth
• points
group points belong
belong
partition to the same
to different
P. cluster of
clusters oftheCclustering structure C and
and to different to the of
groups sameP. (~TP)
group of partition P.
• SD: if points belong to the same cluster of C and to different groups of P.
• SD: if points belong to the same cluster of C and Rand
• and
to different groups ofRP.= (a + d)/M,
Statistic:
• DS:
Assuming •nowif if
points
DS: that a,belong
points b,
belong
to
c and different
d are the
to different
clusters
number
clusters
C
of of
of C and SS, to the
SD,
to the same
same
DS and
group
group
ofDD
of P.
P. pairs respectively,
• DD: • Jaccard
C and Coefficient: J = a/(a +(~TN)
b + c),
• DD:if ifboth
bothpoints
points belong
belong totodifferent
differentclusters
clusters ofand
of C to different of P. of P.
groupsgroups
then a + b + c + d = M which is the maximum number of all pairs in the data set (meaning,
to different
M = N (N −Assuming
1)/2 where
Assuming nowN
now isa
that
that the
a b, cctotal
,, b, and number
andddarearethethe of points
number
number of SS,
of The
SS, SD,inSD, theand
DS data
DS set).
andpairs
DD DDrespectively,
pairstake
respectively,
above two indices values between 0 and 1, and are maximized whe
Now wethen
canadefine
then+a+b+b +cthe
c++ddfollowing
==M which isindices
M which isthe
themaximumto measure
maximum number
number ofthe degree
allofpairs in theof
all pairs in similarity
the
data data
set between C
set (meaning,
(meaning,
MM N (N−−1)/2
N (N 1)/2where
where N N isisthe
thetotal
totalnumber
number
Another
of points in the
of points
index
indata
is
the set).
the:
data set).
and P: ==
NowNowwewecan
candefine
define the
the following
followingindices to to
indices measure the degree
measure of similarity
the degree between
of similarity C
between C
P: P:
andand • Folkes and Mallows index:
• Rand Statistic: R = (a + d)/M,
• Rand Statistic:
• Jaccard•Coefficient: J= = (a ++
R a/(a d)/M,
b + c), √ a a
!
Rand Statistic: R= (a + d)/M,
• Jaccard Coefficient: J = a/(a + b + c),
• Jaccard Coefficient: J = a/(a + b + c), FM = a/ m 1 m 2 = ·
a+b a+c
The above two
The indices
above twotake values
indices between
take values 0 0and
between and 1, andareare
1, and maximized
maximized when mwhen
= s. m = s.
Theisabove
Another
Another index twois indices
index
the: the: take values between 0 and 1, and are maximized when m = s.
Another index is the: where m 1 = (a + b), m 2 = (a + c).
• Folkes and Mallows index: For the previous three indices it has been proven that high values of indices in
• Folkes and Mallows
• Folkes index: index:
and Mallows similarity between C and P. The higher the values of these indices are the mor
√ a a
!
FM = a/ m! 1m2 = ! · and P are. Other indices are: (2)
√ √ a a +ab a ·a + ac
FM = a/ FMm= m
1 2
a/ = m 1 m 2 = · (2)(2)
where m 1 = (a + b), a = b(aa++
m 2+ b ca + c
ac).+ • Huberts ! statistic:
For the previous three indices it has been proven that high values of indices indicate great
where (a + b),C m
m 1 =between P.(a c). the values of these indices are the more similar C
similarity
where m 1 = For
(a +
and theb),
P are. m 2 =indices
previous
Other three
2=
and
(a +indices
+higher
The N −1 "
c). it has been proven that high values of indices
are: " N
indicate great
17
For the previous three indices it has been proven that high ! = (1/M)
values of
similarity between C and P. The higher the values of these indices are the more indices X (i, j)Y
indicate (i, j)
similargreat
C
and• PHuberts
similarity betweenare. C statistic:
!and
Other P. Theare:
indices higher the values of these indices are i=1 the j=i+1
more similar C
and P are. Other indices are:"
N −1 N
High values of this index indicate a strong similarity between X and Y .
"

03 Clustering
No ratings yet
03 Clustering
63 pages
Clustering
No ratings yet
Clustering
53 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
No ratings yet
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
63 pages
Data Science Session 8 Clustering V0
No ratings yet
Data Science Session 8 Clustering V0
30 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
Clustering
No ratings yet
Clustering
12 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
UNIT5
No ratings yet
UNIT5
60 pages
Clustering
No ratings yet
Clustering
65 pages
Cluster
100% (1)
Cluster
72 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
Intro to Clustering Methods
No ratings yet
Intro to Clustering Methods
39 pages
8 Clustering2
No ratings yet
8 Clustering2
84 pages
Hierar Scale4
No ratings yet
Hierar Scale4
51 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Module - 05 Machine Learning (BCS602) Search Creators
No ratings yet
Module - 05 Machine Learning (BCS602) Search Creators
47 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Unit 5 Cluster Analysis
No ratings yet
Unit 5 Cluster Analysis
15 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Clustering
No ratings yet
Clustering
75 pages
Open Lecture 13 - DBSCAN PDF
No ratings yet
Open Lecture 13 - DBSCAN PDF
33 pages
Clustering Slides
No ratings yet
Clustering Slides
22 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
L08 Hierachical Agglomerative Clustering
No ratings yet
L08 Hierachical Agglomerative Clustering
41 pages
Unit 5
No ratings yet
Unit 5
63 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
82 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Clustering
No ratings yet
Clustering
118 pages
Clustering
No ratings yet
Clustering
75 pages
ML Mod 5
No ratings yet
ML Mod 5
47 pages
Clustering
No ratings yet
Clustering
11 pages
Lecture8 Unsupervised Learning
No ratings yet
Lecture8 Unsupervised Learning
58 pages
Module 5
No ratings yet
Module 5
43 pages
Cluster Analysis & K-Means Limitations
No ratings yet
Cluster Analysis & K-Means Limitations
84 pages
Eml 10 250825
No ratings yet
Eml 10 250825
91 pages
Clustering
No ratings yet
Clustering
55 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
K Medoids
No ratings yet
K Medoids
101 pages
Unit 2
No ratings yet
Unit 2
33 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
K-means and Hierarchical Clustering Guide
No ratings yet
K-means and Hierarchical Clustering Guide
53 pages
M5
No ratings yet
M5
40 pages
Week 10
No ratings yet
Week 10
84 pages
4.5-Cluster Analysis
No ratings yet
4.5-Cluster Analysis
17 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
AIMLB PGP 2025 Session 12
No ratings yet
AIMLB PGP 2025 Session 12
45 pages
Day 3
No ratings yet
Day 3
74 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
ML - 8
No ratings yet
ML - 8
70 pages
20 - 1 - ML - UNSUP - 02 - Hierarchical Clustering
No ratings yet
20 - 1 - ML - UNSUP - 02 - Hierarchical Clustering
41 pages
Clustering Part1
No ratings yet
Clustering Part1
79 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
M6
No ratings yet
M6
23 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
93 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Multiplying Two-Digit by Two-Digit Numbers Education Presentation in Cream Green Orange Nostalgic Handdrawn Style
No ratings yet
Multiplying Two-Digit by Two-Digit Numbers Education Presentation in Cream Green Orange Nostalgic Handdrawn Style
13 pages
Problem Solving
No ratings yet
Problem Solving
20 pages
Newtons Law of Motion - Phys213
No ratings yet
Newtons Law of Motion - Phys213
8 pages
Precise Limit and Continuity Concepts
No ratings yet
Precise Limit and Continuity Concepts
12 pages
Pic18 (L) f2x4xk22 - Instruction Set
No ratings yet
Pic18 (L) f2x4xk22 - Instruction Set
49 pages
Dice Maiden: TRPG Dice Rolling Bot Guide
No ratings yet
Dice Maiden: TRPG Dice Rolling Bot Guide
3 pages
Yaestj: Antenna Rotator Model G-450XL
No ratings yet
Yaestj: Antenna Rotator Model G-450XL
12 pages
Cellular Respiration for Students
No ratings yet
Cellular Respiration for Students
15 pages
Properties of Matter Explained
No ratings yet
Properties of Matter Explained
112 pages
Marine Fuel Oil Insights
No ratings yet
Marine Fuel Oil Insights
118 pages
Fixed Axial Pump
No ratings yet
Fixed Axial Pump
76 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
28 pages
Sizing Program V1-2-C Basic - EnG
No ratings yet
Sizing Program V1-2-C Basic - EnG
15 pages
Grade 8 Cbse Math 2nd Term Sample Paper 1
100% (1)
Grade 8 Cbse Math 2nd Term Sample Paper 1
2 pages
Understanding Quadrilaterals
No ratings yet
Understanding Quadrilaterals
2 pages
Comprehensive Guide to Report Writing
No ratings yet
Comprehensive Guide to Report Writing
127 pages
Art Quadra FX Manual
No ratings yet
Art Quadra FX Manual
76 pages
Extraction Notes
No ratings yet
Extraction Notes
16 pages
NPSH Calculation
100% (5)
NPSH Calculation
14 pages
Ah en Ax SW Suite Change Notes 8040 en 24
100% (1)
Ah en Ax SW Suite Change Notes 8040 en 24
89 pages
Kinematics of Rectilinear Motion
No ratings yet
Kinematics of Rectilinear Motion
20 pages
L-3 CSS From Basic To Advance
No ratings yet
L-3 CSS From Basic To Advance
18 pages
RT - Specific Exam
100% (1)
RT - Specific Exam
3 pages
Chemistry Worksheet 5
No ratings yet
Chemistry Worksheet 5
4 pages
Quickstudy Laminated Reference Guides Nursing 1St Edition by Barchart 1423203089 978-1423203087
No ratings yet
Quickstudy Laminated Reference Guides Nursing 1St Edition by Barchart 1423203089 978-1423203087
15 pages
Gurgaon
No ratings yet
Gurgaon
48 pages
Blended DLP G10 Week3 Lesson 2
No ratings yet
Blended DLP G10 Week3 Lesson 2
10 pages
Zeroth Review PPT Template (20-24)
No ratings yet
Zeroth Review PPT Template (20-24)
15 pages
AWG Using Peltier
No ratings yet
AWG Using Peltier
50 pages
Assignment Electric Charges & Field
No ratings yet
Assignment Electric Charges & Field
6 pages

Clustering 2

Uploaded by

Clustering 2

Uploaded by

15-06-21

After feature selection, clustering depends

Mundo How many clusters? Six Clusters

Two Clusters Four Clusters

Master Bioinformatics, Freie Universitaet

Master Bioinformatics, Freie Universitaet

o Partitional clustering approach

Master Bioinformatics, Freie Universitaet

K-means Clustering K-Means

o Initial centroids are often chosen randomly.

Master Bioinformatics, Freie Universitaet

Limitations of K-means: Differing Sizes Limitations of K-means: Differing Density

K-means (3 Clusters) Original Points K-means (3 Clusters)

Original Points K-means (2 Clusters)

Master Bioinformatics, Freie Universitaet

• Traditional hierarchical algorithms use a similarity or

Complexity of hierarchical clustering

• Distance matrix is used for deciding which

• At least quadratic in the number of data

• Not usable for large datasets

Agglomerative Clustering Algorithm

Promedio del grupo

How to Define Inter-Cluster Similarity How to Define Inter-Cluster Similarity

How to Define Inter-Cluster Similarity How to Define Inter-Cluster Similarity

Master Bioinformatics, Freie Universitaet

Taken from Jing Gao SUNY Buffalo

• -Neighborhood – Objects within a radius of from

Core, Border & Outlier

Outlier Given and MinPts,

A point is a core point if it has more than a

A border point has fewer than MinPts

A noise point is any point that is not a

Original Points Point types: core,

• q is directly density-reachable from p

• Density-Reachable (directly and indirectly):

Example of Clustering Comparison

• How good the clustering process was?

How many clusters? Six Clusters

Two Clusters Four Clusters

Master Bioinformatics, Freie Universitaet

For 2, 3, and 4, we can further distinguish whether we want

What is Cluster Analysis?

Master Bioinformatics, Freie Universitaet

2.3.10.5.2. Drawbacks 15-06-21

2.3. Clustering — scikit-learn 0.21.2 documentation

2.3.10.7. Davies-Bouldin Index

» Then the Davies-Bouldin index is defined as:

The computation of Davies-Bouldin is simpler than that of Silhouette scores.

x is a data point in cluster Ci and mi is the representative

Master Bioinformatics, Freie Universitaet

o One easy way to reduce SSE is to increase K, the number

Master Bioinformatics, Freie Universitaet

 Silhouette Coefficient combine ideas of both cohesion and

We reject (accept) H0 If q’s value for our data set, is gr-

You might also like