Density-Based Spatial Clustering Of Applications With Noise
(DBSCAN)
• Clusters are dense regions in the data space, separated by regions of the
lower density of points. The DBSCAN algorithm is based on this intuitive
notion of “clusters” and “noise”.
• The key idea is that for each point of a cluster, the neighborhood of a given
radius has to contain at least a minimum number of points.
Why DBSCAN?
• Partitioning methods (K-means, PAM clustering) and hierarchical clustering
work for finding spherical-shaped clusters or convex clusters. In other
words, they are suitable only for compact and well-separated clusters.
Moreover, they are also severely affected by the presence of noise and
outliers in the data. Figure 1
• Real-life data may contain irregularities, like:
1. Clusters can be of arbitrary shape such as those shown in the figure below.
2. Data may contain noise.
The figure 1 shows a data set containing non-convex shape clusters and
outliers. Given such data, the k-means algorithm has difficulties in identifying
these clusters with arbitrary shapes.
Parameters Required For DBSCAN Algorithm
• eps: It defines the neighborhood around a data point i.e. if the distance between two points
is lower or equal to ‘eps’ then they are considered neighbors.
If the eps value is chosen too small then a large part of the data will be considered as an
outlier.
If it is chosen very large then the clusters will merge and the majority of the data points
will be in the same clusters.
• MinPts: Minimum number of neighbors (data points) within eps radius. The larger the
dataset, the larger value of MinPts must be chosen.
As a general rule, the minimum MinPts can be derived from the number of dimensions D
in the dataset as, MinPts >= D+1. The minimum value of MinPts must be chosen at least 3.
Steps used in DBSCAN Algorithm
• Find all the neighbor points within eps and identify the core points or visited
with more than MinPts neighbors.
• For each core point if it is not already assigned to a cluster, create a new cluster.
• Find recursively all its density-connected points and assign them to the same
cluster as the core point.
• A point a and b are said to be density connected if there exists a point c which
has a sufficient number of points in its neighbors and both points a and b are
within the eps distance. This is a chaining process. So, if b is a neighbor of c, c
is a neighbor of d, and d is a neighbor of e, which in turn is neighbor of a
implying that b is a neighbor of a.
• Iterate through the remaining unvisited points in the dataset. Those points that
do not belong to any cluster are noise.