Why do we need a Density-Based clustering algorithm like DBSCAN when we already have
K-means clustering?
K-Means clustering may cluster loosely related observations together. Every observation
becomes a part of some cluster eventually, even if the observations are scattered far away in the
vector space. Since clusters depend on the mean value of cluster elements, each data point plays
a role in forming the clusters. A slight change in data points might affect the clustering outcome.
This problem is greatly reduced in DBSCAN due to the way clusters are formed. This is usually
not a big problem unless we come across some odd shape data.
Another challenge with k-means is that you need to specify the number of clusters (“k”) in order
to use it. Much of the time, we won’t know what a reasonable k value is a priori.
What’s nice about DBSCAN is that you don’t have to specify the number of clusters to use it. All
you need is a function to calculate the distance between values and some guidance for what
amount of distance is considered “close”. DBSCAN also produces more reasonable results than
k-means across a variety of different distributions.
Density-Based Clustering refers to unsupervised learning methods that identify distinctive
groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region of
high point density, separated from other such clusters by contiguous regions of low point density.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm for
density-based clustering. It can discover clusters of different shapes and sizes from a large
amount of data, which is containing noise and outliers.
The DBSCAN algorithm uses two parameters:
minPts: The minimum number of points (a threshold) clustered together for a region
to be considered dense.
eps (ε): A distance measure that will be used to locate the points in the neighborhood
of any point.
These parameters can be understood if we explore two concepts called Density Reachability and
Density Connectivity.
Reachability in terms of density establishes a point to be reachable from another if it lies within
a particular distance (eps) from it.
Connectivity, on the other hand, involves a transitivity based chaining-approach to determine
whether points are located in a particular cluster. For example, p and q points could be connected
if p->r->s->t->q, where a->b means b is in the neighborhood of a.
There are three types of points after the DBSCAN clustering is complete:
Core — This is a point that has at least m points within distance n from itself.
Border — This is a point that has at least one Core point at a distance n.
Noise — This is a point that is neither a Core nor a Border. And it has less than m
points within distance n from itself.
Algorithmic steps for DBSCAN clustering
The algorithm proceeds by arbitrarily picking up a point in the dataset (until all points
have been visited).
If there are at least ‘minPoint’ points within a radius of ‘ε’ to the point then we
consider all these points to be part of the same cluster.
The clusters are then expanded by recursively repeating the neighborhood calculation
for each neighboring point
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding
spherical-shaped clusters or convex clusters. In other words, they are suitable only for compact
and well-separated clusters. Moreover, they are also severely affected by the presence of noise
and outliers in the data.
Given the points A(3, 7), B(4, 6), C(5, 5), D(6, 4), E(7, 3), F(6, 2), G(7, 2) and H(8, 4), Find the
core points and outliers using DBSCAN. Take Eps = 2.5 and MinPts = 3.
Given, Epsilon(Eps) = 2.5
Minimum Points(MinPts) = 3
Let’s represent the given data points in tabular form:
The diagonal elements of this matrix will always be 0 as the distance of a point with itself is
always 0. In the above table, Distance ≤ Epsilon (i.e. 2.5) is marked red.
Step 2: Now, finding all the data points that lie in the Eps-neighborhood of each data points.
That is, put all the points in the neighborhood set of each data point whose distance is <=2.5.
N(A) = {B}; — — — — — — -→ because distance of B is <= 2.5 with A
N(B) = {A, C}; — — — — — → because distance of A and C is <= 2.5 with B
N(C) = {B, D}; — — — — —→ because distance of B and D is <=2.5 with C
N(D) = {C, E, F, G, H}; — → because distance of C, E, F,G and H is <=2.5 with D
N(E) = {D, F, G, H}; — — → because distance of D, F, G and H is <=2.5 with E
N(F) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with F
N(G) = {D, E, F, H}; — — -→ because distance of D, E, F and H is <=2.5 with G
N(H) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with H
Here, data points A, B and C have neighbors <= MinPts (i.e. 3) so can’t be considered as core
points. Since they belong to the neighborhood of other data points, hence there exist no outliers
in the given set of data points.
Data points D, E, F, G and H have neighbors >= MinPts (i.e. 3) and hence are the core data
points.