DMDW Unit 5 PPT Cluster Analysis 06.01.2021
DMDW Unit 5 PPT Cluster Analysis 06.01.2021
Cluster Analysis
Lecture No:1
Dr T V MADHUSUDHANA RAO
Introduction
Types of Clustering
Types of clusters
Applications
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Original Points
A Partitional Clustering
6 5
0.2
4
3 4 0.15
2
5
2 0.1
1 0.05
3 1
0
1 3 2 5 4 6
Fuzzy clustering
– In a fuzzy clustering, every object belongs to every cluster with a
membership weight that is between 0 (absolutely doesn't belong) and 1
(absolutely belongs).
– In fuzzy clustering, we often impose the additional constraint that the sum
of the weights for each object must equal 1
– Well-separated clusters
– Prototype-Based clusters
– Graph-Based clusters
– Density-based clusters
A small bridge of
points can merge
two distinct clusters
Business
Businesses collect large amounts of information about current and potential customers.
Climate
Find patterns in atmospheric pressure and ocean temperature that have a significant impact on Earth’s climate.
Biology
–Creating a taxonomy (hierarchical classification) of all living things
–Finding groups of genes that have similar functions
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 19
UNIT-V
Cluster Analysis
Lecture No:2
Dr T V MADHUSUDHANA RAO
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 20
Contents
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 21
Data Objects
Data sets are made up of data objects.
A data object represents an entity.
Examples:
sales database: customers, store items, sales
medical database: patients, treatments
university database: students, professors, courses
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 22
What Is an Attribute?
An attribute is a data field, representing a characteristic or feature of a
data object.
The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
The term dimension is commonly used in data warehousing.
Machine learning literature tends to use the term feature, while
Statisticians prefer the term variable.
Data mining and database professionals commonly use the term attribute
Ex:
Attributes
describing a customer object can include, customer ID,
name, and address.
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 23
Attribute Types
The type of an attribute is determined by the set of possible values
Nominal Attributes
Binary Attributes
Ordinal Attributes
Numeric: quantitative
Interval-scaled
Ratio-scaled
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 24
Proximity measures
Measuring Data Similarity and Dissimilarity
A cluster is a collection of data objects such that the objects within a cluster are
similar to one another and dissimilar to the objects in other clusters.
– The higher the similarity value, the greater the similarity between objects
– The higher the dissimilarity value, the more dissimilar the two objects are.
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 30
Data Matrix versus Dissimilarity Matrix
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 31
Data Matrix versus Dissimilarity Matrix
0
d(2,1) 0
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 32
Proximity Measure for Nominal Attributes
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 33
Proximity Measure for Nominal Attributes
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 35
Proximity Measures for Binary
Attributes
Example: Relational Table Where Patients Are Described by Binary Attributes
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
0 1
d ( jack , mary ) 0.33
2 0 1
11
d ( jack , jim) 0.67
111
1 2
d ( jim, mary ) 0.75
11 2
Of the three patients, Jack and Mary are the most likely to have a
similar disease.
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 36
Dissimilarity of Numeric Data: Minkowski Distance
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 37
Special Cases of Minkowski Distance
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 38
Special Cases of Minkowski Distance
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 39
Cosine Similarity
Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 40
Cosine Similarity
Example:
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 41
Thank you
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 42
UNIT-V
Cluster Analysis
Lecture No:3
Dr T V MADHUSUDHANA RAO
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 43
Contents
Clustering Algorithms
K-means and its variants
K-means: Additional Issues
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 44
Clustering Algorithms
Partitioning clustering
– K-means and its variants
– K-medoid
Hierarchical clustering
– Agglomerative (Bottom-Up)
– Divisive (Top-Down)
Density-based clustering
– DBSCAN
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 45
K-means
– Prototype-based clustering techniques create a one-level partitioning of the data
objects
Most prominent techniques are K-means and K-medoid
K-means defines a prototype in terms of a centroid, which is usually the mean of a group of
points
K-medoid defines a prototype in terms of a medoid, which is the most representative point for
a group of points
It requires only a proximity measure for a pair of objects
– A centroid almost never corresponds to an actual data point,
– A medoid, by its definition, must be an actual data point
– K-means is one of the oldest and most widely used clustering algorithms
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 47
The Basic K-means Algorithm:
Step 1: Choose the Step 2: Select k Step 3: Assign all the Step 4: Recompute the Step 5: Repeat steps 3
number of clusters k random points from points to the closest centroids of newly and 4
the data as centroids
cluster centroid formed clusters
Given Data
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 48
K-means Clustering
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 49
K-means Clustering – Details
Initial centroids are often chosen randomly.
The centroid is (typically) the mean of the points in the cluster.
‘Closeness’ is measured by Euclidean distance, cosine similarity,
correlation, etc.
Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 50
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)-Objective Function
– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them.
– Given two clusters, we can choose the one with the smallest error
– One easy way to reduce SSE is to increase K, ie. the number of clusters
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 51
Choosing Initial Centroids
Two different K-means Clustering
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 52
Importance of Choosing Initial Centroids
In Figure even though the initial centroids seem to be better distributed, we obtain a
suboptimal clustering, with higher squared error.
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 53
K-means: Additional Issues
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 56
Handling Empty Clusters and Outliers
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 57
Updating Centers Incrementally
In the basic K-means algorithm, centroids are updated after all points are
assigned to a centroid
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 58
Pre-processing and Post-processing
Pre-processing
– Normalize the data
– Eliminate outliers
Post-processing
– Split ‘loose’ clusters, i.e., clusters with relatively high SSE
– Merge clusters that are ‘close’ and that have relatively low SSE
– Eliminate small clusters that may represent outliers
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 59
Thank you
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 60
UNIT-V
Cluster Analysis
Lecture No:4
Dr T V MADHUSUDHANA RAO
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 61
Contents
Bisecting K-means
Limitations of K-means
Strengths and Weaknesses of K-means
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 62
Bisecting K-means
It is a straight forward extension of the basic K-means
algorithm
To obtain K clusters, split the set of all points into two clusters
select one of these clusters to split, and so on, until K clusters
have been produced
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 63
Bisecting K-means Example-1
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 64
Bisecting K-means Example-2
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 65
Bisecting K-means Algorithm
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 66
K-means and Different Types of Clusters
Limitations of K-means
K-means has problems when clusters are of differing
– Sizes
– Densities
– Non-globular shapes
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 67
Limitations of K-means: Differing Sizes
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 68
Limitations of K-means: Differing
Density
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 69
Limitations of K-means: Non-globular Shapes
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 70
Overcoming K-means Limitations
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 71
Overcoming K-means Limitations
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 72
Overcoming K-means Limitations
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 73
Strengths and Weaknesses of K-means
Strengths of K-means:
K-means is simple and can be used for a wide variety of data types.
It is also quite efficient, even though multiple runs are often performed.
Some variants, including bisecting K-means, are even more efficient,
and are less susceptible to initialization problems.
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 74
Strengths and Weaknesses of K-means
Weaknesses of K-means:
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 75
Thank you
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 76
UNIT-V
Cluster Analysis
Lecture No:5
Dr T V MADHUSUDHANA RAO
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 77
Contents
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 78
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree
Can be visualized as a dendrogram
– A tree like diagram that records the sequences of merges or splits
– Displays both the cluster-subcluster relationships
6 5
4
3 4
2
5
2
1
3 1
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 79
Types of Hierarchical Clustering
Two main types of hierarchical clustering
– Agglomerative:
Starting with individual points as clusters
Successively merge the two closest clusters
until only one cluster remains
– Divisive:
Start with one, all-inclusive cluster
At each step, split a cluster until only singleton
clusters of individual points remain
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 80
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
MIN or Single Link .
MAX or Complete Linkage .
Group Average . Proximity Matrix
Distance Between Centroids
Other methods driven by an objective function
– Ward’s Method uses squared error
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 81
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 82
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MAX or Complete Linkage: .
The proximity of two clusters is defined as
.
the maximum of the distance between any. Proximity Matrix
two points in the two different clusters
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 83
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
Group Average:
.
The proximity of two clusters is defined as
. Proximity Matrix
the average pairwise proximity among all
pairs of points in the different clusters
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 84
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
Distance Between Centroids:
.
Calculate the proximity between two
clusters by calculating the distance .
between the centroids of clusters .
Proximity Matrix
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 85
Hierarchical Clustering: MIN or Single Link
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 86
Hierarchical Clustering: MIN or Single Link
The proximity of two clusters is defined as the minimum of the distance (maximum
of the similarity) between any two points in the two different clusters.
Determined by one pair of points, i.e., by one link in the proximity graph.
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 87
Hierarchical Clustering: MIN or Single Link
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 88
Hierarchical Clustering: MIN or Single Link
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 89
Hierarchical Clustering: MIN or Single Link
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 90
Hierarchical Clustering: MIN or Single Link
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 91
Hierarchical Clustering: MIN or Single Link
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 92
Cluster Similarity: MAX or Complete Linkage or CLIQUE
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 93
Cluster Similarity: MAX or Complete Linkage or CLIQUE
The proximity of two clusters is defined as the maximum of the distance (minimum
of the similarity) between any two points in the two different clusters. CLIQUE
Clustering In QUEst
Determined by all pairs of points in the two clusters
.
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 94
Cluster Similarity: MAX or Complete Linkage or CLIQUE
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 95
Cluster Similarity: MAX or Complete Linkage or CLIQUE
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 96
Cluster Similarity: MAX or Complete Linkage or CLIQUE
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 97
Cluster Similarity: Group Average
The proximity of two clusters is defined as the average pairwise proximity among
all pairs of points in the different clusters.
This is an intermediate approach between the single and complete link
approaches.
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 98
Hierarchical Clustering: Group Average
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 99
Cluster Similarity : Centroid and Ward’s
Method
Centroid methods : calculate the proximity between two clusters by
calculating the distance between the centroids of clusters.
– These techniques may seem similar to K-means
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 100
Cluster Similarity: Ward’s Method
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 101
Hierarchical Clustering:
Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 102
Basic Agglomerative Hierarchical Clustering
Algorithm
More popular hierarchical clustering technique
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 105
MST: Divisive Hierarchical
Clustering
Build MST (Minimum Spanning Tree)
– Start with a tree that consists of any point
– In successive steps, look for the closest pair of points (p, q) such that one point
(p) is in the current tree but the other (q) is not
– Add q to the tree and put an edge between p and q
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 106
MST: Divisive Hierarchical
Clustering
Use MST for constructing hierarchy of clusters
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 107
Strengths and Weaknesses of Hierarchical
Clustering
Strengths of Hierarchical Clustering
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 108
Strengths and Weaknesses of Hierarchical
Clustering
Weaknesses of Hierarchical Clustering
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 109
Thank you
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 110
UNIT-V
Cluster Analysis
Lecture No:6
Dr T V MADHUSUDHANA RAO
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 111
Contents
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 112
DBSCAN (Density-based spatial clustering of
applications with noise)
Density-based clustering locates regions of high density that are separated from
one another by regions of low density.
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 113
DBSCAN (Density-based spatial clustering of
applications with noise)
Traditional Density: Center-Based Approach
In the center-based approach, density is estimated for a particular point in the
data set by counting the number of points within a specified radius, .Eps, of
that point.
The number of points within a radius of Eps of point A is 7, including A itself
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 114
DBSCAN (Density-based spatial clustering of
applications with noise)
Core points: A point is a core point if it has more than a specified
number of points (MinPts) within Eps
These are points that are at the interior of a cluster
In Figure, point A is a core point, for the indicated radius
(Eps) if MinPts <= 7.
Input parameters:
MinPts
Eps
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 115
DBSCAN (Density-based spatial clustering of
applications with noise)
Border points : A border point is not a core point, but falls within the
neighborhood of a core point.
In Figure, point B is a border point.
Noise points: A noise point is any point that is neither a core point nor a
border point. In Figure, point C is a noise point.
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 116
DBSCAN: Core, Border, and Noise
Points
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 117
DBSCAN Algorithm
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 118
DBSCAN: Core, Border and Noise Points
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 119
Strengths and Weaknesses of DBSCAN
Strengths of DBSCAN
• Resistant to Noise
• Can handle clusters of different shapes and sizes
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 120
Strengths and Weaknesses of DBSCAN
Weaknesses of DBSCAN
(MinPts=4, Eps=9.75)
Original Points
• Clusters have widely varying densities
• High-dimensional data - density is more difficult
to define also
• DBSCAN can be expensive in case of high-
dimensional data.
(MinPts=4, Eps=9.92)
Department of Electronics and Computer Engineering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 121
Thank you
Department ofKumar
© Tan,Steinbach, Electronics and Computer
Introduction Engineering
to Data Mining 4/18/2004 122