KEMBAR78
7 - Chapter 7-Chapter 7 - Density-Based Clustering Methods | PDF | Cluster Analysis | Theoretical Computer Science
0% found this document useful (0 votes)
77 views30 pages

7 - Chapter 7-Chapter 7 - Density-Based Clustering Methods

The document discusses the DBSCAN clustering algorithm. It defines important concepts like density, epsilon, and minimum points. It explains how DBSCAN classifies points as core, border, or noise based on these parameters and the density of surrounding points. The steps of the DBSCAN algorithm are also outlined.

Uploaded by

mazeen naser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views30 pages

7 - Chapter 7-Chapter 7 - Density-Based Clustering Methods

The document discusses the DBSCAN clustering algorithm. It defines important concepts like density, epsilon, and minimum points. It explains how DBSCAN classifies points as core, border, or noise based on these parameters and the density of surrounding points. The steps of the DBSCAN algorithm are also outlined.

Uploaded by

mazeen naser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Chapter 7

Density-Based Spatial Clustering of Applications with


Noise
DBSCAN algorithm

1
• Outline

• What is density?
• Important parameters of the DBSCAN algorithm
• Classification of data points
• Density edge and density connected points
• Steps in the DBSCAN algorithm
• How to determine epsilon and z?
• Noise elimination
• Practical

2
• Concept of Density
• Density-Based Clustering refers to unsupervised learning
methods that identify distinctive groups/clusters in the data, based
on the idea that a cluster in data space is a contiguous region of
high point density, separated from other such clusters by
contiguous regions of low point density.

3
• Why do we need a Density-Based clustering algorithm like
DBSCAN when we already have K-means clustering and
Hierarchical?
• Why do we need DBSCAN Clustering?

• To answer this question we need know the drawback/


Weaknesses of each K- means cluster and Hierarchical

4
• Disadvantages-

• K-Means Clustering Algorithm has the following disadvantages-
• It requires to specify the number of clusters (k) in advance.
• It can not handle noisy data and outliers.
• It is not suitable to identify clusters with non-convex shapes.

5
• Drawback / weaknesses of Hierarchical
• High time complexity, if you use the Single-link method in order to determine
the inter-cluster distance, the results may suffer from the chain-effect.
• initial seeds have a strong impact on the final results
• The order of the data has an impact on the final results
• Very sensitive to outliers

6
• What is density?
• First of all, let’s understand what is density.
• Well from Physics we know that density is just the amount of matter
present in a unit volume. We can easily extend this idea of volume into
higher dimensions or even in a lower dimension.
• For example, we have this region.
• We have some data points in this region. And we have another region of
the same area we have got these many data points here.

7
• the density of the first region is greater than the second region.
Because, there are more data points, more matter in the first region.
DBSCAN uses this concept of density to cluster the dataset. Now to
understand the DBSCAN algorithm clearly, we need to know some
important parameters.

8
• Important parameters of the DBSCAN algorithm

9
• Parameters:
• The DBSCAN algorithm basically requires 2 parameters:
1- Epsilon-eps (ε):
eps (ε): specifies how close points should be to each other to be considered a
part of a cluster.
Epsilon (eps): It is defined as the maximum distance between two points which are
considered as neighboring points as well as can be viewed as the radius around each
point
- It means that if the distance between two points is lower or equal to this value
(eps), these points are considered neighbors.

10
• Neighborhood
• Suppose, this is the point we are considering right now, and let we draw a circle
around this point making this as a center and add a distance Epsilon.
• So, we are gonna say this circle as this point’s neighborhood. So, epsilon is just a
number that represents the radius of the circle around a particular point that we are
going to consider the neighborhood of that point.
• Example : Draw circle with Eps

11
2- minPoints (min_pts) : the minimum number of points to form a dense
region.
For example, if we set the minPoints parameter as 5, then we need at least 5
points to form a dense region.
- The minimum number of points (a threshold) clustered together for a
region to be considered dense.

12
• Density at point p: number of points within a circle of radius Eps

• Dense Region: A circle of radius Eps that contains at least MinPts points

◼ Density = number of points within a specified radius r (Eps)

13
Example : Pick another point and we can do the same thing, this time with a
different set of Epsilon-eps (ε)neighbors (one of them even being the first
point we picked out).

14
• Note :Re do that with big value of Epsilon-eps (ε)

• Please record your notes

15
• Parameter estimation:
• The parameter estimation is a problem for every data mining task.
• To choose good parameters we need to understand how they are used and
have at least a basic previous knowledge about the data set that will be used.

• eps (ε):: if the eps value chosen is too small, a large part of the data will not
be clustered.
• It will be considered outliers because don’t satisfy the number of points to
create a dense region. On the other hand, if the value that was chosen is too
high, clusters will combine / merge and the majority of objects will be in the
same cluster.
• The eps should be chosen based on the distance of the dataset (we can use a
k-distance graph to find it), but in general small eps values are preferable.
– Eps: Maximum radius of the neighbourhood

16
• minPoints: As a general rule, a minimum minPoints can be derived from a
number of dimensions (D) in the data set, as minPoints ≥ D + 1.
• Larger values are usually better for data sets with noise and will form more
significant clusters.
• The minimum value for the minPoints must be 3, but the larger the data set, the
larger the minPoints value that should be chosen.

– MinPts: Minimum number of points in an Eps-neighbourhood of that point

17
• Classification of data points
• Now based on these two parameters i.e., epsilon and min _ samples, we are
first going to classify every point in our dataset into three categories. They are
• Core points
• Boundary points or border points
• Noise points

18
• Core points
• must be greater than or equal to our threshold min_samples
– A point is a core point if it has more than a specified number of points (MinPts) within
Eps
These are points that are at the interior of a cluster

Core Point: The data point x is the core point since it has at least min_pts (n) within
epsilon (eps) distance.

Example : Suppose that we have some epsilon and set the minimum number of points
to (MinPts)= 3

• For figure , we considered the red point a core point

19
• A point is a core point if it has more than a specified number of points
(MinPts) within Eps
• These points belong in a dense region and are at the interior of a cluster

• Core Point (P): The point P is said to be the core point if P has greater than MinPts in an Eps radius
around it. These points always belong to the dense region and are at the interior of a cluster.

• If I say a point as a core point then it must satisfy one condition. The condition is the number of
neighbors must be greater than or equal to our threshold min_samples

20
• Exercises:
Enhance your understand (MinPts) within Eps
• Suppose that we have some epsilon and set the minimum number of points to MinPts =3.
• We will now look at two points of the dataset. On the left, we look at the above point, while
on the right, we look at one of the middle points.

21
• Example 2 Suppose that we have some epsilon and set the minimum number of points to
MinPts =3.

22
• Border point
• If I say one point as a boundary point, then it has to satisfy the following two
conditions.
• The number of neighbors must be less than (MinPts)
• The point should be in the neighborhood of a core point.
• Consider the same figure mentioned above.
• Border Point: The data point y is the border point since it has at least one core
point within epsilon (eps) distance and lower than min_pts (n) within epsilon (eps)
distance from it.

•.

23
• Example 2 Suppose that we have some epsilon and set the minimum number of points
to MinPts =3.

• conditions.
• The number of neighbors must be less than (MinPts)
• The point should be in the neighborhood of a core point
24
• Noise points
• The definition of noise point is very simple. If a point is neither a
core point nor a Border point, then it is called a noise point. In the
above-mentioned figure, point C is neither a core point nor a
boundary point. So, we can say that as a noise point.
• A noise point is any point that is not a core point or a border point
•.

25
• DBSCAN: Core, Border, and Noise Points

26
•,

27
• Explain the DBSCAN Algorithm step by step.
• The major steps followed during the DBSCAN algorithm are as
follows:
• Step-1: Decide the value of the parameters eps and min_pts.
• Step-2: For each data point(x) present in the dataset:

• Compute its distance from all the other data points. If the distance
is less than or equal to the value of epsilon(eps), then consider that
point as a neighbour of x.
• If that data point(x) gets the count of its neighbour greater than or
equal to min_pts, then mark it as a core point or as visited.

28
• Step-3: For each core point, if it is not already assigned to a
cluster then create a new cluster. Further, all the neighbouring
points are recursively determined and are assigned the same
cluster as that of the core point
• Step-4: Repeat the above steps until all the points are visited.

29
• End

30

You might also like