KEMBAR78
Clustering Part1 | PDF | Cluster Analysis | Data Mining
0% found this document useful (0 votes)
29 views79 pages

Clustering Part1

Uploaded by

ankityadav10291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views79 pages

Clustering Part1

Uploaded by

ankityadav10291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Clustering Techniques

1
What is Cluster Analysis?
• Given a set of objects, place them in groups such that the objects in
a group are similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

2
Applications of Cluster Analysis

• Understanding
• Group related documents for
browsing, group genes and proteins
that have similar functionality, or
group stocks with similar price
fluctuations

• Summarization
• Reduce the size of large data sets

Clustering precipitation
in Australia
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
marketing programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
• City-planning: Identifying groups of houses according to their house
type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
4
Requirements of Clustering in Data Mining
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input
parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability

5
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

6
Types of Clustering
• A clustering is a set of clusters

• Important distinction between hierarchical and partitional sets of


clusters

• Partitional Clustering
• A division of data objects into non-overlapping subsets (clusters)

• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree

7
Partitional Clustering

Original Points A Partitional Clustering

8
Hierarchical Clustering

Traditional Hierarchical Clustering Traditional Dendrogram

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

9
Other Distinctions Between Sets of Clusters
• Exclusive versus non-exclusive
• In non-exclusive clusterings, points may belong to multiple clusters.
• Can belong to multiple classes or could be ‘border’ points
• Fuzzy clustering (one type of non-exclusive)
• In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1
• Weights must sum to 1
• Probabilistic clustering has similar characteristics
• Partial versus complete
• In some cases, we only want to cluster some of the data

10
Types of Clusters
• Well-separated clusters

• Prototype-based clusters

• Contiguity-based clusters

• Density-based clusters

• Described by an Objective Function

11
Types of Clusters: Well-Separated
• Well-Separated Clusters:
• A cluster is a set of points such that any point in a cluster is closer (or more
similar) to every other point in the cluster than to any point not in the
cluster.

3 well-separated clusters

12
Types of Clusters: Prototype-Based
• Prototype-based
• A cluster is a set of objects such that an object in a cluster is closer (more
similar) to the prototype or “center” of a cluster, than to the center of any
other cluster
• The center of a cluster is often a centroid, the average of all the points in the
cluster, or a medoid, the most “representative” point of a cluster

4 center-based clusters

13
Types of Clusters: Contiguity-Based
• Contiguous Cluster (Nearest neighbor or Transitive)
• A cluster is a set of points such that a point in a cluster is closer (or more
similar) to one or more other points in the cluster than to any point not in
the cluster.

8 contiguous clusters

14
Types of Clusters: Density-Based
• Density-based
• A cluster is a dense region of points, which is separated by low-density
regions, from other regions of high density.
• Used when the clusters are irregular or intertwined, and when noise and
outliers are present.

6 density-based clusters

15
Types of Clusters: Objective Function
• Clusters Defined by an Objective Function
• Finds clusters that minimize or maximize an objective function.
• Enumerate all possible ways of dividing the points into clusters and evaluate
the `goodness' of each potential set of clusters by using the given objective
function. (NP Hard)
• Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
• A variation of the global objective function approach is to fit the data to a
parameterized model.
• Parameters for the model are determined from the data.
• Mixture models assume that the data is a ‘mixture' of a number of statistical
distributions.

16
Characteristics of the Input Data Are Important
• Type of proximity or density measure
• Central to clustering
• Depends on data and application

• Data characteristics that affect proximity and/or density are


• Dimensionality
• Sparseness
• Attribute type
• Special relationships in the data
• For example, autocorrelation
• Distribution of the data

• Noise and Outliers


• Often interfere with the operation of the clustering algorithm
• Clusters of differing sizes, densities, and shapes

17
What Is Good Clustering?
• A good clustering method will produce high
quality clusters with
• high intra-class similarity
• low inter-class similarity
• The quality of a clustering result depends on
both the similarity measure used by the
method and its implementation.

18
Clustering Algorithms
• K-means and its variants

• Hierarchical clustering

• Density-based clustering

19
Partitioning Clustering Approach
• Partitioning algorithms construct partition of a database of N objects into
a set of K clusters.
• The partitioning clustering algorithm usually adopts the Iterative
Optimization paradigm.
• It starts with an initial partition and uses an iterative control strategy.
• It tries swapping data points to see if such a swapping improves the quality of
clustering.
• When swapping does not yield any improvements in clustering, it finds a locally
optimal partitioning
• in principle, optimal partition achieved via minimizing the sum of
squared distance to its “representative object” in each cluster

e.g., Euclidean distance

20
K-means algorithm
• Given the cluster number K, the K-means algorithm
is carried out in three steps after initialization:
• Initialization: set seed points (randomly)
1. Assign each object to the cluster of the nearest seed
point measured with a specific distance metric

2. Compute new seed points as the centroids of the clusters


of the current partition (the centroid is the centre, i.e.,
mean point , of the cluster)

3. Go back to Step 1), stop when no more new assignment


(i.e., membership in each cluster no longer changes)

21
K-means - Example
• Problem:
• Suppose we have 4 types of medicines and each has two attributes (pH and
weight index). Our goal is to group these objects into K=2 group of
medicine.

D
Medicine Weight pH-Inde
x
C
A 1 1

B 2 1

C 4 3 A B

D 5 4

22
K-means - Example
• Step 1: Use initial seed points for partitioning

Assign each object to the cluster


with the nearest seed point

23
K-means - Example
• Step 2: Compute new centroids of the current partition
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.

24
K-means - Example
• Step 2: Renew membership based on new centroids

Compute the distance of all


objects to the new centroids

Assign the membership to objects

25
K-means - Example
• Step 3: Repeat the first two steps until its convergence
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.

26
K-means - Example
• Step 3: Repeat the first two steps until its convergence

Compute the distance of all objects


to the new centroids

Stop due to no new assignment


Membership in each cluster no
longer change

27
Strengths of k-means
• Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(tkn), where n is the
number of data points, k is the number of clusters,
and t is the number of iterations.

28
Weaknesses of k-means
• The algorithm is only applicable if the mean is
defined.
• The user needs to specify k.
• The algorithm is sensitive to outliers
• Outliers are data points that are very far away
from other data points.
• Outliers could be errors in the data recording or
some special data points with very different
values.

29
Weaknesses of k-means: Problems with outliers

30
Weaknesses of k-means: To deal with outliers
• One method is to remove some data points
in the clustering process that are much
further away from the centroids than other
data points.
• To be safe, we may want to monitor these
possible outliers over a few iterations and then
decide to remove them.
• Another method is to perform random
sampling. Since in sampling we only choose
a small subset of the data points, the
chance of selecting an outlier is very small.
• Assign the rest of the data points to the clusters
by distance or similarity comparison, or
classification

31
Weaknesses of k-means
• The algorithm is sensitive to initial seeds.

32
Weaknesses of k-means
• If we use different seeds: good results
There are some
methods to help
choose good seeds

33
Weaknesses of k-means
• The k-means algorithm is not suitable for discovering clusters that
are not hyper-ellipsoids (or hyper-spheres).

34
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids)
• The algorithm is intended to find a sequence of objects
called medoids that are centrally located in clusters
• The goal of the algorithm is to minimize the average
dissimilarity of objects to their closest selected object.
• PAM works effectively for small data sets, but does not
scale well for large data sets
• CLARA
• CLARANS

35
PAM Partition Around Medoids
1) Pick a number, k, of random data items as medoids
2) Calculate
The pair (n,m) of medoid/non-medoid
with the smallest impact on clustering quality

3) If TCmn < 0, replace m by n and go back to 2


4) Assign every item to its nearest medoid

36
Swapping Cost
• For each pair of a medoid m and a non-medoid object
h, measure whether h is better than m as a medoid
• For example, we can use the squared-error criterion

• Compute Eh-Em
• Negative: swapping brings benefit
• Choose the minimum swapping cost

37
K-medoids Example
X1 2 6 Distance to X5 to X9
s
X2 3 4
X1 8 7
X3 3 8
X4 5 7 X2 5 6

X5 6 2 Assume k=2 X3 9 8
X6 6 4
Select X5 and X9 as medoids X4 7 6
X7 7 3 X6 2 3
X8 7 4 X7 2 3
X9 8 5 X8 3 2
x10 7 6 x10 5 2

Current clustering: {X2,X5,X6,X7},{X1,X3,X4,X8,X9,X10}

38
X1 2 6
X2 3 4

K-medoids Example X3
X4
3
5
8
7
• So, now let us choose some other point to be a medoid instead of X5 (6, 2). Let us X5 6 2
randomly choose X1 (2, 6). X6 6 4
X7 7 3
• Not the new medoid set is: (2, 6) and (8, 5). Now repeating the same task as earlier: X8 7 4
Replace Befor to X1 To Chang X9 8 5
X5 by X1 e X9 e x10 7 6
X1 7 0 0 -7
X2 5 3 6 -2
X3 8 3 8 -5
X4 6 4 6 -2
X5 0 8 5 5
X6 2 6 3 1
X7 2 8 3 1
X8 2 7 2 0
X9 0 0 0 0 Current clustering: {X1,X2,X3,X4},{X5,X6,X7,X8,X9,X10}
x10 2 5 2 0
-9

39
K-medoids Properties

40
CLARA (Clustering Large Applications)
• CLARA (Clustering Large
Applications) uses a
sampling-based method to deal
with large data sets
• A random sample should closely
represent the original data
• The chosen medoids will likely be
similar to what would have been
chosen from the whole data set

41
CLARA (Clustering Large Applications)
• Draw multiple samples of the data set
• Apply PAM to each sample
• Return the best clustering

42
CLARA Properties

43
CLARA - Algorithm
• Set mincost to MAXIMUM;
• Repeat q times // draws q samples
• Create S by drawing s objects randomly from D;
• Generate the set of medoids K from S by applying the
PAM algorithm;
• Compute cost(K,D)
• If cost(K, D)<mincost
Mincost = cost(K, D);
Bestset = K;
• Endif;
• Endrepeat;
• Return Bestset;

44
Complexity of CLARA
• Set mincost to MAXIMUM; O(1)
• Repeat q times O(t(s-k)2*k+(n-k)*k)
• Create S by drawing s objects
randomly from D; O(1)
• Generate the set of medoids K
from S by applying the PAM
algorithm; O(t(s-k)2*k)
• Compute cost(K,D) O((n-k)*k)
• If cost(K, D)<mincost O(1)
Mincost = cost(K, D);
Bestset = K;
Endif;
• Endrepeat;
• Return Bestset; 45
CLARANS (“Randomized” CLARA)
• CLARANS (A Clustering Algorithm based on
Randomized Search)
• The clustering process can be presented as searching a
graph where every node is a potential solution, that
is, a set of k medoids
• Two nodes are neighbours if their sets differ by only
one medoid
• Each node can be assigned a cost that is defined to be
the total dissimilarity between every object and the
medoid of its cluster
• The problem corresponds to search for a minimum on
the graph
• At each step, all neighbours of current node node are
searched; the neighbour which corresponds to the
deepest descent in cost is chosen as the next solution

46
CLARANS (“Randomized” CLARA)
• CLARANS (A Clustering Algorithm
based on Randomized Search)
• The clustering process can be
presented as searching a graph
where every node is a potential
solution, that is, a set of k medoids
• Graph Abstraction
• Every node is a potential solution
(k-medoid)
• Two nodes are adjacent if they differ
by one medoid
• Every node has k(n−k) adjacent nodes

47
CLARANS (“Randomized” CLARA)
• For large values of n and k, examining k(n-k) neighbours is time
consuming.
• At each step, CLARANS draws sample of neighbours to examine.
• Note that CLARA draws a sample of nodes at the beginning of
search; therefore, CLARANS has the benefit of not confining the
search to a restricted area.
• If the local optimum is found, CLARANS starts with a new randomly
selected node in search for a new local optimum. The number of
local optimums to search for is a parameter.
• It is more efficient and scalable than both PAM and CLARA; returns
higher quality clusters.
48
Compare no more than maxneighbor
CLARANS times

N C N

N N
<
C
… Local
minimum

N N numlocal
… Local
minimum

… Local
minimum
Best Node
… Local
minimum
49
CLARANS - Algorithm
• Set mincost to MAXIMUM;
• For i=1 to h do // find h local optimum
• Randomly select a node as the current node C in the graph;
• J = 1; // counter of neighbors
• Repeat
Randomly select a neighbor N of C;
If Cost(N,D)<Cost(C,D)
Assign N as the current node C;
J = 1;
Else J++;
Endif;
• Until J > m
• Update mincost with Cost(C,D) if applicableEnd for;
• End For
• Return bestnode;
50
Hierarchical Clustering
• Hierarchical Clustering Approach
• A typical clustering analysis approach via partitioning data set
sequentially
• Construct nested partitions layer by layer via grouping objects into
a tree of clusters (without the need to know the number of
clusters in advance)
• Use (generalised) distance matrix as clustering criteria
• Agglomerative vs Divisive
• Two sequential clustering strategies for constructing a tree of clusters
• Agglomerative: a bottom-up strategy
• Initially each data object is in its own (atomic) cluster
• Then merge these atomic clusters into larger and larger clusters
• Divisive: a top-down strategy
• Initially all objects are in one single cluster
• Then the cluster is subdivided into smaller and smaller clusters

51
Hierarchical Clustering
• Agglomerative approach
Initialization:
Each object is a cluster
Iteration:
a
ab Merge two clusters which are
b abcde most similar to each other;
Until all objects are merged
c
cde into a single cluster
d
de
e

Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up

52
Hierarchical Clustering
• Divisive Approaches
Initialization:
All objects stay in one cluster
Iteration:
a Select a cluster and split it into
ab
two sub clusters
b abcde Until each leaf cluster contains
c only one object
cde
d
de
e

Step 4 Step 3 Step 2 Step 1 Step 0 Top-down

53
Dendrogram
• A binary tree that shows how clusters are
merged/split hierarchically
• Each node on the tree is a cluster; each leaf
node is a singleton cluster

54
Dendrogram
• A clustering of the data objects is obtained by
cutting the dendrogram at the desired level,
then each connected component forms a cluster

55
Dendrogram
• A clustering of the data objects is obtained by
cutting the dendrogram at the desired level, then
each connected component forms a cluster

56
How to Merge Clusters?
• How to measure the distance between clusters?

Single-link
Complete-link
Distance?
Average-link
Centroid distance

Hint: Distance between clusters is usually defined


on the basis of distance between objects.

57
How to Define Inter-Cluster Distance

Single-link
Complete-link
The distance between two clusters is
Average-link represented by the distance of the closest pair
Centroid distance of data objects belonging to different clusters.

58
How to Define Inter-Cluster Distance

Single-link
Complete-link
The distance between two clusters is
Average-link represented by the distance of the farthest pair
Centroid distance of data objects belonging to different clusters.

59
How to Define Inter-Cluster Distance

Single-link
Complete-link
Average-link The distance between two clusters is
Centroid distance represented by the average distance of all pairs
of data objects belonging to different clusters.

60
How to Define Inter-Cluster Distance

× ×
mi,mj are the means
of Ci, Cj,

Single-link
Complete-link
Average-link The distance between two clusters is
represented by the distance between the
Centroid distance means of the cluters.

61
Cluster Distance Measures
Example: Given a data set of five objects characterized by a single continuous feature, assume that there
are two clusters: C1: {a, b} and C2: {c, d, e}.
a b c d e
1 2 4 5 6

1. Calculate the distance matrix. 2. Calculate threelink


Single cluster distances between C1 and C2.
a b c d e

a 0 1 3 4 5

b 1 0 2 3 4 Complete link

c 3 2 0 1 2
Average
d 4 3 1 0 1

e 5 4 2 1 0

62
Agglomerative Algorithm
• The Agglomerative algorithm is carried out in three steps:
1) Convert all object features into
a distance matrix
2) Set each object as a cluster
(thus if we have N objects, we
will have N clusters at the
beginning)
3) Repeat until number of cluster
is one (or known # of clusters)
▪ Merge two closest clusters
▪ Update “distance matrix”

63
Example
• Problem: clustering analysis with agglomerative algorithm

data matrix

Euclidean distance

distance matrix
64
Example
• Merge two closest clusters (iteration 1)

65
Example
• Update distance matrix (iteration 1)

66
Example
• Merge two closest clusters (iteration 2)

67
Example
• Update distance matrix (iteration 2)

68
Example
• Merge two closest clusters/update distance matrix (iteration 3)

69
Example
• Merge two closest clusters/update distance matrix (iteration 4)

70
Example
• Final result (meeting termination condition)

71
Example
• Dendrogram tree representation
1. In the beginning we have 6
clusters: A, B, C, D, E and F
6 2. We merge clusters D and F into
cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B
into (A, B) at distance 0.71
e
lifetim

4. We merge clusters E and (D, F)


5 into ((D, F), E) at distance 1.00
5. We merge clusters ((D, F), E) and C
4 into (((D, F), E), C) at distance 1.41
3 6. We merge clusters (((D, F), E), C)
2 and (A, B) into ((((D, F), E), C), (A, B))
at distance 2.50
7. The last cluster contain all the objects,
thus conclude the computation
object
72
Example
• Dendrogram tree representation
• For a dendrogram tree, its horizontal axis
indexes all objects in a given data set, while
6
its vertical axis expresses the lifetime of all
possible cluster formation.
e
lifetim

5 • The lifetime of a cluster (individual cluster)


in the dendrogram is defined as a distance
4
interval from the moment that the cluster is
3
2 created to the moment that it disappears by
merging with other clusters.

object
73
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4

5
1
2
5
2
3 6 Group Average
3
4 1
4

74
Which Distance Measure is Better?
• Each method has both advantages and disadvantages;
application-dependent, single-link and complete-link
are the most common methods
• Single-link
• Can find irregular-shaped clusters
• Sensitive to outliers, suffers the so-called chaining effects
• In order to merge two groups, only need one pair of points to be
close, irrespective of all others. Therefore clusters can be too spread
out, and not compact enough
• Average-link, and Centroid distance
• Robust to outliers
• Tend to break large clusters

75
AGNES
• AGNES : Agglomerative Nesting
• Use single-link method
• Merge nodes that have the least dissimilarity
• Eventually all objects belong to the same cluster

76
UPGMA
• UPGMA: Un-weighted Pair-Group Method Average.
• Merge Strategy:
• Average-link approach
• The distance between two clusters is measured by the average distance
between two objects belonging to different clusters.

d avg (C i , C j ) =
1
ni n j
∑ ∑ d ( p, q )
p∈C i q∈C j
Average
distance
ni,nj: the number of objects in cluster
C i, C j.

77
DIANA
• DIANA: Divisive Analysis
• First, all of the objects form one cluster.
• The cluster is split according to some principle, such as the minimum
Euclidean distance between the closest neighboring objects in the
cluster.
• The cluster splitting process repeats until, eventually, each new
cluster contains a single object, or a termination condition is met.

78
C

Splitting Process of DIANA C2 C1

1. Choose the object Oh which is most dissimilar


to other objects in C.
2. Let C1={Oh}, C2=C-C1. C2 C1

3. For each object Oi in C2, tell whether it is more


close to C1 or to other objects in C2 C2 C1

C2 C1
4. Choose the object Ok with greatest D score.
5. If Dk>0, move Ok from C2 to C1, and repeat 3-5.
……
6. Otherwise, stop splitting process.
C2 C1

79

You might also like