Density Based Clustering:
DBSCAN
Why DBSCAN over K-Means or Hierarchical
clustering?
• In the figure first two columns, it is clear that DBSCAN correctly
identifies clusters in circular and spiral patterns. Whereas, K-Means fails
because it forces clusters into spherical shapes.
• In third column, DBSCAN correctly identifies dense clusters while
marking noise points separately. Other side, K-Means incorrectly assigns
noise points to clusters.
• In forth column, well-separated dense clusters. Both DBSCAN and K-
Means perform well here.
• In fifth column, data are uniformly distributed. DBSCAN classifies
everything as one cluster (no clear separation). K-Means still assigns
multiple clusters based on distance.
When to use DBSCAN Clustering
• Use DBSCAN when:
• Clusters have irregular shapes (e.g., spirals, rings).
• Noise or outliers are present.
• Varying density is expected.
• Use K-Means when:
• Clusters are well-separated and spherical.
• The dataset is large and high-dimensional.
• Noise is minimal.
Density-based spatial clustering of applications
with noise (DBSCAN)
• DBSCAN is a density-based clustering algorithm that groups data points
based on density rather than distance.
• DBSCAN can effectively discover clusters of any arbitrary shape.
• Noise points, which do not meet the density criteria, are labeled as outlier.
• Clusters are usually in high-density regions and outliers tend to be in low-
density regions.
• It requires minimum domain knowledge, can discover clusters of arbitrary
shape.
• In the DBSCAN algorithm, only two key hyperparameters are required to
identify clusters in a dataset: the first parameter, epsilon (ε), defines the
radius of the neighborhood within which points are considered neighbors,
and the second parameter, MinPts (minimum points), specifies the minimum
number of data points required within this radius for a point to be classified
as a core point.
Terminologies
• 1. Epsilon (ε)
• The radius within which DBSCAN searches for neighboring points.
• A point is considered part of a cluster if it has enough neighboring
points within ε.
• Choosing the right ε value is crucial:
• Too small → Many small clusters
• Too large → Merging of distinct clusters.
• 2. Minimum Points (MinPts)
➢The minimum number of points required within the ε-radius for a
point to be considered a core point.
➢Typically set as MinPts ≥ D+1, where D is the dataset's dimensionality.
• 3. Core Point
➢A point that has at least MinPts points (including itself) within its ε-
radius.
➢Forms the center of a cluster.
• 4. Border Point
➢A point that falls within the ε-radius of a core point but does not have
enough neighbors to be a core point itself.
➢Unlike Core Points, Non-Core (border) Points can only join a cluster
and can not be used to extend it further.
• DBSCAN operates on the idea of density-reachable and density-connected.
• 4. Density-reachable
➢A point q is density-reachable from a point p using Eps and MinPts if
there is a chain of points 𝑃1 , . . . , 𝑃𝑛 , 𝑃1 = 𝑝, 𝑃𝑛 = 𝑞 such that 𝑃𝑖+1 is
directly density-reachable from 𝑃𝑖 .
➢Two border points of the same cluster Cluster are possibly not density
reachable from each other because the core point condition might not
hold for both of them.
➢However, there must be a core point in a Cluster from which both border
points of C are density-reachable.
• 5. Density-connected
➢A point ‘p’ is density-connected to a point ‘q’, if there is a point ‘o’ such
that both ‘p’ and ‘q’ are density-reachable from ‘o’ using Eps and MinPts.
➢Intuitively, a cluster is defined to be a set of density-connected points
which is maximal with respected to density-reachable.
Finding Clusters in a 2D Dataset
• We have the following dataset representing points in a 2D space:
• (1, 2), (2, 2), (2, 3), (8, 8), (8, 9), (25, 80), (9, 8)
• We set: ε (radius) = 2 and MinPts = 3
• Now, DBSCAN follows these steps:
1. Identify Core Points
• A point is a core point if at least MinPts = 3 points (including itself) are
within the ε-radius.
• Consider (2,2): Within ε = 2, we find: (1,2), (2,2), (2,3) → Total: 3
points that means point (2, 2) is a (Core point).
• Consider (8,8): Within ε = 2, we find: (8,8), (8,9), (9,8) → Total: 3
points that means point (8, 8) is a (Core point).
2. Identify Border Points
• A border point is within the ε-radius of a core point but does not have
enough neighbors to be a core point itself.
• Points (1,2) and (2,3) are within the ε-radius of core point (2,2), so
they are border points.
3. Identify Noise Points
• Points that do not belong to any cluster (not a core or border point)
are outliers (noise).
• Point (25,80) has no neighbors within ε = 2, so it is classified as noise
or outlier.
How DBSCAN Works
• Set Parameters (ε and minPts): Choose an epsilon (ε) distance and a
minimum points (minPts) count for density.
• Find Neighbor Points: For each point, identify neighboring points
within distance ε.
• Form New Cluster: If a point has at least minPts neighbors, it forms
the core of a new cluster.
• Expand Cluster: Add all density-reachable points to the cluster.
• Label Noise: Points not meeting density requirements are labeled as
noise.
• Q. Given the points A(3, 7), B(4, 6), C(5, 5), D(6, 4), E(7, 3), F(6, 2),
G(7, 2) and H(8, 4), Find the core points and outliers using DBSCAN.
Take Eps = 2.5 and MinPts = 3.
• Solution:
• Given, Epsilon (Eps) = 2.5
• Minimum Points (MinPts) = 3
• Let’s represent the given data points in tabular form:
• Step 1: To find the core points, outliers and clusters by using DBSCAN,
we need to first calculate the distance among all pairs of given data
point. Let us use Euclidean distance measure for distance calculation.
• The distance matrix as follows:
• In the above table, Distance ≤ Epsilon (i.e. 2.5) is marked red.
• Step 2: Now, finding all the data points that lie in the Eps-neighborhood
of each data points. That is, put all the points in the neighborhood set of
each data point whose distance is <=2.5.
N(A) = {B}; — — — — — — -→ because distance of B is <= 2.5 with A
N(B) = {A, C}; — — — — — → because distance of A and C is <= 2.5 with B
N(C) = {B, D}; — — — — —→ because distance of B and D is <=2.5 with C
N(D) = {C, E, F, G, H}; — → because distance of C, E, F,G and H is <=2.5 with D
N(E) = {D, F, G, H}; — — → because distance of D, F, G and H is <=2.5 with E
N(F) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with F
N(G) = {D, E, F, H}; — — -→ because distance of D, E, F and H is <=2.5 with G
N(H) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with H
• Data points B, C, D, E, F, G and H have neighbors >= MinPts (i.e. 3) and
hence are the core data points.
• Data point A is a border point.
• There exist no outliers in the given set of data points.
Advantages of DBSCAN
• Arbitrary Cluster Shapes: Handles clusters of varying shapes and
densities.
• No Pre-Specified K: Automatically determines the number of clusters.
• Robust to Outliers: Noise points are left out of clusters, reducing
skew.
Disadvantages of DBSCAN
• Sensitive to Parameters: Results depend on careful tuning of ε and
minPts.
• Challenges with High Dimensions: Suffers from the curse of
dimensionality.
• Difficulty with Density Variation: Clusters with different densities can
be hard to capture.
Comparison of K-Means, Hierarchical, & DBSCAN