KEMBAR78
DMDW Unit 5 PPT Cluster Analysis 06.01.2021 | PDF | Cluster Analysis | Applied Mathematics
0% found this document useful (0 votes)
35 views112 pages

DMDW Unit 5 PPT Cluster Analysis 06.01.2021

Uploaded by

Gayathri Karri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views112 pages

DMDW Unit 5 PPT Cluster Analysis 06.01.2021

Uploaded by

Gayathri Karri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 112

UNIT-V

Cluster Analysis

Lecture No:1

Dr T V MADHUSUDHANA RAO

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1


Contents

Introduction

Types of Clustering
Types of clusters
Applications

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2


Introduction

 Cluster: Divide the data objects into conceptually


meaningful groups
Clustering is an unsupervised learning algorithm.
Clustering can be applied on unlabeled data.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3


What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or related) to
one another and different from (or unrelated to) the objects in other groups.

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4


Notion of a Cluster can be Ambiguous
Figure shows twenty points and three different ways of dividing them into clusters

Four Clusters Six Clusters

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5


Types of Clusterings

 Clustering : An entire collection of clusters is commonly


referred to as a clustering
 Types:
– Hierarchical(nested) versus Partitional(unnested)
– Exclusive versus Overlapping versus Fuzzy
– Complete versus Partial

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6


Partitional Clustering
It is simply a division of the set of data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset

Original Points
A Partitional Clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7


Hierarchical Clustering
 Produces a set of nested clusters organized as a hierarchical tree
 Can be visualized as a dendrogram
– A tree like diagram that records the sequences of merges or splits

6 5
0.2

4
3 4 0.15
2
5
2 0.1

1 0.05
3 1

0
1 3 2 5 4 6

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8


Exclusive versus overlapping versus Fuzzy

 Exclusive versus overlapping (or non-exclusive)


– In Exclusive clustering, assign each object to a single cluster
– In non-exclusive clusterings, an object can simultaneously belong to more
than one group (class)
 A person at a university can be both an enrolled student and an employee of the
university or ‘border’ points

 Fuzzy clustering
– In a fuzzy clustering, every object belongs to every cluster with a
membership weight that is between 0 (absolutely doesn't belong) and 1
(absolutely belongs).
– In fuzzy clustering, we often impose the additional constraint that the sum
of the weights for each object must equal 1

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9


Complete versus Partial
 Complete versus Partial
– A complete clustering assigns every object to a cluster
– In Partial clustering, some objects in a data set may not belong to well-
defined groups
Many times objects in the data set may represent noise or outliers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10


Types of Clusters

– Well-separated clusters

– Prototype-Based clusters

– Graph-Based clusters

– Density-based clusters

– Shared-Property (Conceptual Clusters)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11


Types of Clusters: Well-
 Well-SeparatedSeparated
Clusters:
– A cluster is a set of objects in which each object is closer (or more
similar) to every other object in the cluster than to any object not in the
cluster.
– Well-separated clusters do not need to be globular, but can have any
shape

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12


Types of Clusters: Prototype-Based
 Prototype-Based
– Cluster is a set of objects in which each object is closer (more similar) to
the prototype that defines the cluster than to the prototype of any other
cluster.
– For data with continuous attributes, the prototype of a cluster is often a
centroid, i.e., the average (mean) of all the points in the cluster.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13


Types of Clusters: Graph-Based
 Graph-Based: If the data is represented as a graph, where the nodes are
objects and the links represent connections among objects then a cluster can be
defined as a connected component.
 An example of graph-based clusters are contiguity-based clusters, where two
objects are connected only if they are within a specified distance of each other.
 Clusters are irregular, but can have trouble when noise is present.

A small bridge of
points can merge
two distinct clusters

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14


Types of Clusters: Density-Based
 Density-based
– A cluster is a dense region of objects, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular and when noise and
outliers are present.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15


Shared-Property (Conceptual Clusters)

 Shared-Property (Conceptual Clusters)


– Finds clusters that share some common property or represent a
particular concept.
– A triangular area (cluster) is adjacent to a rectangular one,
and there are two intertwined(connect or link closely) circles
(clusters).

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16


Applications of cluster analysis
Information Retrieval
Group the search results into a small number of clusters, each of which captures a particular aspect of the query.
–For instance, a query of “movie” might return web pages grouped into categories such as reviews, trailers, stars, and
theatres.

Business
Businesses collect large amounts of information about current and potential customers.

Climate
Find patterns in atmospheric pressure and ocean temperature that have a significant impact on Earth’s climate.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17


Applications of cluster analysis
Psychology and Medicine

Clustering has been used to identify different types of depression


like Major , Persistent etc.

Biology
–Creating a taxonomy (hierarchical classification) of all living things
–Finding groups of genes that have similar functions

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18


Thank you

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 19
UNIT-V

Cluster Analysis

Lecture No:2

Dr T V MADHUSUDHANA RAO

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 20
Contents

Data Objects and Attribute Types


Proximity measures

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 21
Data Objects
Data sets are made up of data objects.
A data object represents an entity.

Examples:
sales database: customers, store items, sales
medical database: patients, treatments
university database: students, professors, courses

Alsocalled samples , examples, instances, data points, objects, tuples.


Data objects are described by attributes.

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 22
What Is an Attribute?
An attribute is a data field, representing a characteristic or feature of a
data object.

The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
The term dimension is commonly used in data warehousing.
Machine learning literature tends to use the term feature, while
Statisticians prefer the term variable.
Data mining and database professionals commonly use the term attribute

Ex:
Attributes
describing a customer object can include, customer ID,
name, and address.

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 23
Attribute Types
The type of an attribute is determined by the set of possible values
Nominal Attributes

Binary Attributes

Ordinal Attributes

Numeric: quantitative

 Interval-scaled

 Ratio-scaled

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 24
Proximity measures
Measuring Data Similarity and Dissimilarity

A cluster is a collection of data objects such that the objects within a cluster are
similar to one another and dissimilar to the objects in other clusters.
– The higher the similarity value, the greater the similarity between objects
– The higher the dissimilarity value, the more dissimilar the two objects are.

– Distances are normally used to measure the similarity or dissimilarity


between two data objects

– Proximity refers to a similarity or dissimilarity

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 30
Data Matrix versus Dissimilarity Matrix

Data matrix (or object-by-attribute structure):


This structure stores the n data objects in the form of a relational table,

or n-by-p matrix (n objects x p attributes):

 x11 ... x1f ... x1p 


 
 ... ... ... ... ... 
x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 31
Data Matrix versus Dissimilarity Matrix

Dissimilarity matrix (or object-by-object structure):


This structure stores a collection of proximities that are available for all pairs of n objects.

It is often represented by an n-by-n table:

 0 
 d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0

where d(i, j) is the measured dissimilarity or “difference” between objects i and j.

Measures of similarity can often be expressed as a function of measures of dissimilarity

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 32
Proximity Measure for Nominal Attributes

A nominal attribute can take on two or more states


Ex.: Map color is a nominal attribute that may have, say, five
states: red, yellow, green, pink, and blue.
 Simple matching method:
–The dissimilarity between two objects i and j can be computed based
on the ratio of mismatches:
d (i, j)  p 
p
m

–where m is the number of matches (i.e., the number of attributes for


which i and j are in the same state), and
–p is the total number of attributes describing the objects.

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 33
Proximity Measure for Nominal Attributes

Simple matching method: d (i, j)  p 


p
m
–Example: m: # of matches,
p: total # of
attributes

–Since here we have one nominal attribute, test-1, we set p =1 in Eq


–so that d.(i, j) evaluates to 0 if objects i and j match,
and 1 if the objects differ. Thus, we get
From this, we see that all objects are dissimilar except
objects 1 and 4 (i.e., d.(4,1) = 0)
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 34
Proximity Measures for Binary
Attributes
 A contingency table for binary data

 Distance measure for symmetric


binary variables:

 Distance measure for asymmetric


binary variables:

 Jaccard coefficient (similarity


measure for asymmetric binary
variables):

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 35
Proximity Measures for Binary
Attributes
Example: Relational Table Where Patients Are Described by Binary Attributes
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

Gender is a symmetric attribute


The remaining attributes are asymmetric binary
Let the values Y and P be 1, and the value N be 0

0 1
d ( jack , mary )   0.33
2  0 1
11
d ( jack , jim)   0.67
111
1 2
d ( jim, mary )   0.75
11 2

Of the three patients, Jack and Mary are the most likely to have a
similar disease.

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 36
Dissimilarity of Numeric Data: Minkowski Distance

Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 37
Special Cases of Minkowski Distance

h = 1: Manhattan (city block, L1 norm) distance

d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

h = 2: (L2 norm) Euclidean distance

d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp

h  . “supremum” (Lmax norm, L norm) distance.


This is the maximum difference between any component (attribute) of the
vectors

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 38
Special Cases of Minkowski Distance

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 39
Cosine Similarity

Cosine similarity is a measure of similarity that can be used to compare documents


or, say, give a ranking of documents with respect to a given vector of query words.
Let x and y be two vectors for comparison.

Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then

cos(d1, d2) = (d1  d2) /||d1|| ||d2||

where  indicates vector dot product, ||d||: the length of vector d

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 40
Cosine Similarity

Example:

cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,


where  indicates vector dot product, ||d|: the length of vector d

Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 41
Thank you

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 42
UNIT-V

Cluster Analysis

Lecture No:3

Dr T V MADHUSUDHANA RAO

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 43
Contents

Clustering Algorithms
K-means and its variants
K-means: Additional Issues

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 44
Clustering Algorithms

 Partitioning clustering
– K-means and its variants
– K-medoid
 Hierarchical clustering
– Agglomerative (Bottom-Up)
– Divisive (Top-Down)
 Density-based clustering
– DBSCAN

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 45
K-means
– Prototype-based clustering techniques create a one-level partitioning of the data
objects
Most prominent techniques are K-means and K-medoid
K-means defines a prototype in terms of a centroid, which is usually the mean of a group of
points
K-medoid defines a prototype in terms of a medoid, which is the most representative point for
a group of points
It requires only a proximity measure for a pair of objects
– A centroid almost never corresponds to an actual data point,
– A medoid, by its definition, must be an actual data point
– K-means is one of the oldest and most widely used clustering algorithms

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 47
The Basic K-means Algorithm:

Step 1: Choose the Step 2: Select k Step 3: Assign all the Step 4: Recompute the Step 5: Repeat steps 3
number of clusters k random points from points to the closest centroids of newly and 4
the data as centroids
cluster centroid formed clusters

Given Data

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 48
K-means Clustering

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 49
K-means Clustering – Details
 Initial centroids are often chosen randomly.
 The centroid is (typically) the mean of the points in the cluster.
 ‘Closeness’ is measured by Euclidean distance, cosine similarity,
correlation, etc.
 Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 50
Evaluating K-means Clusters
 Most common measure is Sum of Squared Error (SSE)-Objective Function

– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them.

– x is a data point in cluster Ci and ci is the representative point for cluster Ci


can show that ci corresponds to the center (mean) of the cluster

– Given two clusters, we can choose the one with the smallest error
– One easy way to reduce SSE is to increase K, ie. the number of clusters

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 51
Choosing Initial Centroids
Two different K-means Clustering

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 52
Importance of Choosing Initial Centroids

Randomly selected initial centroids may be poor.

In Figure even though the initial centroids seem to be better distributed, we obtain a
suboptimal clustering, with higher squared error.

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 53
K-means: Additional Issues

 Handling Empty Clusters


 Outliers
 Reducing the SSE with Post-processing
 Updating Centroids Incrementally

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 56
Handling Empty Clusters and Outliers

 Handling Empty Clusters


– Basic K-means algorithm can yield empty clusters
– Empty clusters can be obtained if no points are allocated to a
cluster during the assignment step
– If this happens, then a strategy is needed to choose a replacement
centroid
 Outliers
– When outliers are present, the resulting SSE will be higher.
– Because of this, it is often useful to discover outliers and eliminate
them beforehand

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 57
Updating Centers Incrementally
 In the basic K-means algorithm, centroids are updated after all points are
assigned to a centroid

 An alternative is to update the centroids after each assignment (incremental


approach)
– More expensive
– Never get an empty cluster

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 58
Pre-processing and Post-processing

 Pre-processing
– Normalize the data
– Eliminate outliers

 Post-processing
– Split ‘loose’ clusters, i.e., clusters with relatively high SSE
– Merge clusters that are ‘close’ and that have relatively low SSE
– Eliminate small clusters that may represent outliers

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 59
Thank you

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 60
UNIT-V

Cluster Analysis

Lecture No:4

Dr T V MADHUSUDHANA RAO

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 61
Contents

Bisecting K-means
Limitations of K-means
Strengths and Weaknesses of K-means

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 62
Bisecting K-means
 It is a straight forward extension of the basic K-means
algorithm
 To obtain K clusters, split the set of all points into two clusters
 select one of these clusters to split, and so on, until K clusters
have been produced

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 63
Bisecting K-means Example-1

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 64
Bisecting K-means Example-2

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 65
Bisecting K-means Algorithm

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 66
K-means and Different Types of Clusters

 Limitations of K-means
 K-means has problems when clusters are of differing
– Sizes
– Densities
– Non-globular shapes

 K-means has problems when the data contains outliers.

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 67
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 68
Limitations of K-means: Differing
Density

Original Points K-means (3 Clusters)

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 69
Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 70
Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.


Find parts of clusters, but need to put together.

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 71
Overcoming K-means Limitations

Original Points K-means Clusters

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 72
Overcoming K-means Limitations

Original Points K-means Clusters

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 73
Strengths and Weaknesses of K-means

Strengths of K-means:

 K-means is simple and can be used for a wide variety of data types.
 It is also quite efficient, even though multiple runs are often performed.
 Some variants, including bisecting K-means, are even more efficient,
and are less susceptible to initialization problems.

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 74
Strengths and Weaknesses of K-means

Weaknesses of K-means:

 It cannot handle non-globular clusters or clusters of different sizes and


densities, although it can typically find pure subclusters if a large enough
number of clusters is specified.
 K-means also has trouble clustering data that contains outliers. Outlier
detection and removal can help significantly in such situations.
 Finally, K-means is restricted to data for which there is a notion of a center
(centroid).

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 75
Thank you

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 76
UNIT-V

Cluster Analysis

Lecture No:5

Dr T V MADHUSUDHANA RAO

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 77
Contents

Hierarchical Clustering – Introduction and Types


Inter-Cluster Similarity
Agglomerative Hierarchical Clustering Algorithm
Divisive Hierarchical Clustering Algorithm
Strengths and Weaknesses of Hierarchical Clustering

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 78
Hierarchical Clustering
 Produces a set of nested clusters organized as a hierarchical tree
 Can be visualized as a dendrogram
– A tree like diagram that records the sequences of merges or splits
– Displays both the cluster-subcluster relationships

6 5

4
3 4
2
5
2

1
3 1

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 79
Types of Hierarchical Clustering
 Two main types of hierarchical clustering
– Agglomerative:
Starting with individual points as clusters
Successively merge the two closest clusters
until only one cluster remains
– Divisive:
 Start with one, all-inclusive cluster
 At each step, split a cluster until only singleton
clusters of individual points remain

 Traditional hierarchical algorithms use a similarity


or distance matrix

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 80
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2

p3

p4

p5
 MIN or Single Link .
 MAX or Complete Linkage .
 Group Average . Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective function
– Ward’s Method uses squared error

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 81
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5

 MIN or Single Link: .


 The proximity of two clusters is defined as
.
Proximity Matrix
the minimum of the distance between any.
two points in the two different clusters.

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 82
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
 MAX or Complete Linkage: .
 The proximity of two clusters is defined as
.
the maximum of the distance between any. Proximity Matrix
two points in the two different clusters

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 83
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5

.
 Group Average:
.
 The proximity of two clusters is defined as
. Proximity Matrix
the average pairwise proximity among all
pairs of points in the different clusters

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 84
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

  p2

p3

p4

p5
 Distance Between Centroids:
.
 Calculate the proximity between two
clusters by calculating the distance .
between the centroids of clusters .
Proximity Matrix

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 85
Hierarchical Clustering: MIN or Single Link

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 86
Hierarchical Clustering: MIN or Single Link

The proximity of two clusters is defined as the minimum of the distance (maximum
of the similarity) between any two points in the two different clusters.
Determined by one pair of points, i.e., by one link in the proximity graph.

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 87
Hierarchical Clustering: MIN or Single Link

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 88
Hierarchical Clustering: MIN or Single Link

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 89
Hierarchical Clustering: MIN or Single Link

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 90
Hierarchical Clustering: MIN or Single Link

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 91
Hierarchical Clustering: MIN or Single Link

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 92
Cluster Similarity: MAX or Complete Linkage or CLIQUE

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 93
Cluster Similarity: MAX or Complete Linkage or CLIQUE

The proximity of two clusters is defined as the maximum of the distance (minimum
of the similarity) between any two points in the two different clusters. CLIQUE
Clustering In QUEst
Determined by all pairs of points in the two clusters
.

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 94
Cluster Similarity: MAX or Complete Linkage or CLIQUE

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 95
Cluster Similarity: MAX or Complete Linkage or CLIQUE

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 96
Cluster Similarity: MAX or Complete Linkage or CLIQUE

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 97
Cluster Similarity: Group Average
 The proximity of two clusters is defined as the average pairwise proximity among
all pairs of points in the different clusters.
 This is an intermediate approach between the single and complete link
approaches.

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 98
Hierarchical Clustering: Group Average

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 99
Cluster Similarity : Centroid and Ward’s
Method
 Centroid methods : calculate the proximity between two clusters by
calculating the distance between the centroids of clusters.
– These techniques may seem similar to K-means

 Ward’s Method : The proximity between two clusters is defined as


the increase in the squared error that results when two clusters are
merged
– Similar to group average if distance between points is distance squared

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 100
Cluster Similarity: Ward’s Method

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 101
Hierarchical Clustering:
Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4

5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 102
Basic Agglomerative Hierarchical Clustering
Algorithm
 More popular hierarchical clustering technique

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 105
MST: Divisive Hierarchical
Clustering
 Build MST (Minimum Spanning Tree)
– Start with a tree that consists of any point
– In successive steps, look for the closest pair of points (p, q) such that one point
(p) is in the current tree but the other (q) is not
– Add q to the tree and put an edge between p and q

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 106
MST: Divisive Hierarchical
Clustering
 Use MST for constructing hierarchy of clusters

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 107
Strengths and Weaknesses of Hierarchical
Clustering
Strengths of Hierarchical Clustering

 Do not have to assume any particular number of clusters


– Any desired number of clusters can be obtained by ‘cutting’ the
dendogram at the proper level

 Obtain meaningful taxonomies


– Example in biological sciences
 Animals can be classified into two main groups:
vertebrates and invertebrates.

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 108
Strengths and Weaknesses of Hierarchical
Clustering
Weaknesses of Hierarchical Clustering

 No objective function is directly minimized

 Different schemes have problems with one or more of the following:


– Sensitivity to noise and outliers
– Difficulty handling different sized clusters
– Breaking large clusters

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 109
Thank you

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 110
UNIT-V

Cluster Analysis

Lecture No:6

Dr T V MADHUSUDHANA RAO

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 111
Contents

DBSCAN (Density-based spatial clustering of


applications with noise)
DBSCAN Algorithm
Strengths and Weaknesses of DBSCAN

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 112
DBSCAN (Density-based spatial clustering of
applications with noise)

Density = number of points within a specified radius (Eps)

Density-based clustering locates regions of high density that are separated from
one another by regions of low density.

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 113
DBSCAN (Density-based spatial clustering of
applications with noise)
Traditional Density: Center-Based Approach
In the center-based approach, density is estimated for a particular point in the
data set by counting the number of points within a specified radius, .Eps, of
that point.
The number of points within a radius of Eps of point A is 7, including A itself

Classification of Points According to Center-Based Density


(1) Core point
(2) Border point
(3) Noise point

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 114
DBSCAN (Density-based spatial clustering of
applications with noise)
Core points: A point is a core point if it has more than a specified
number of points (MinPts) within Eps
 These are points that are at the interior of a cluster
 In Figure, point A is a core point, for the indicated radius
(Eps) if MinPts <= 7.

Input parameters:
MinPts
Eps

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 115
DBSCAN (Density-based spatial clustering of
applications with noise)
Border points : A border point is not a core point, but falls within the
neighborhood of a core point.
In Figure, point B is a border point.

Noise points: A noise point is any point that is neither a core point nor a
border point. In Figure, point C is a noise point.

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 116
DBSCAN: Core, Border, and Noise
Points

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 117
DBSCAN Algorithm

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 118
DBSCAN: Core, Border and Noise Points

Original Points Eps = 10, MinPts = 4

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 119
Strengths and Weaknesses of DBSCAN
Strengths of DBSCAN

Original Points Clusters

• Resistant to Noise
• Can handle clusters of different shapes and sizes

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 120
Strengths and Weaknesses of DBSCAN
Weaknesses of DBSCAN

(MinPts=4, Eps=9.75)

Original Points
• Clusters have widely varying densities
• High-dimensional data - density is more difficult
to define also
• DBSCAN can be expensive in case of high-
dimensional data.

(MinPts=4, Eps=9.92)
Department of Electronics and Computer Engineering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 121
Thank you

Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 122

You might also like