DBSCAN
Introduction
• Clustering analysis or simply Clustering is basically an
Unsupervised learning method that divides the data
points into a number of specific batches or groups, such
that the data points in the same groups have similar
properties and data points in different groups have
different properties in some sense.
• Partitioning methods (K-means, PAM (portioning
around medoids) clustering) and hierarchical clustering
work for finding spherical-shaped clusters or convex
clusters. In other words, they are suitable only for
compact and well-separated clusters.
DBSCAN Clustering
• DBSCAN stands for "Density-Based Spatial
Clustering of Applications with Noise."
• It's a popular clustering algorithm used to
identify clusters of data points in a dataset,
particularly in cases where the clusters might
have complex shapes and varying densities.
• DBSCAN is effective at finding clusters in data
with noise and outliers as well.
• Real-life data may contain irregularities, like:
• Clusters can be of arbitrary shape such as those shown in the figure
below.
• Data may contain noise.
• The figure shows a data set containing non-convex shape clusters and
outliers. Given such data, the k-means algorithm has difficulties in
identifying these clusters with arbitrary shapes.
Parameters Required For DBSCAN Algorithm
• Epsilon (ε): The maximum distance between two
points for them to be considered as neighbors.
• MinPts: The minimum number of points required to
form a dense region (core points).
• Minimum number of neighbors (data points) within
eps radius. The larger the dataset, the larger value
of MinPts must be chosen. As a general rule, the
minimum MinPts can be derived from the number
of dimensions D in the dataset as, MinPts >= D+1.
The minimum value of MinPts must be chosen at
least 3.
• Let's set ε = 1.5 and MinPts = 3.
• We start by randomly selecting a point from
the dataset and check its ε-neighborhood. If
the ε-neighborhood contains at least MinPts
points, we consider it a core point and expand
its cluster. We repeat this process until all the
points are assigned to a cluster or identified as
noise.
Steps Used In DBSCAN Algorithm
• Find all the neighbor points within eps and identify the core points or
visited with more than MinPts neighbors.
• For each core point if it is not already assigned to a cluster, create a new
cluster.
• Find recursively all its density-connected points and assign them to the
same cluster as the core point.
A point a and b are said to be density connected if there exists a
point c which has a sufficient number of points in its neighbors and both
points a and b are within the eps distance. This is a chaining process. So,
if b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e, which
in turn is neighbor of a implying that b is a neighbor of a.
• Iterate through the remaining unvisited points in the dataset. Those
points that do not belong to any cluster are noise.
P9 is noise here
Here's a step-by-step breakdown of the DBSCAN algorithm for this
example:
• Suppose we have the following dataset of 2D points:
•
• Points: [(2, 3), (2, 5), (3, 4), (5, 3), (5, 5), (6, 4), (8, 2), (8,
4), (9, 5)]
• Start with the first point (2, 3). Its ε-neighborhood
contains the points (2, 5), (3, 4), and (5, 3). Since the ε-
neighborhood has MinPts (3) or more points, we create
a new cluster and assign these points to the cluster.
• Cluster 1: [(2, 3), (2, 5), (3, 4), (5, 3)]
• Next, we move to the next unvisited point (5,
5). Its ε-neighborhood contains the point (6,
4). Since the ε-neighborhood has fewer than
MinPts (3) points, (5, 5) is marked as noise.
•
• Noise: [(5, 5)]
• We move to the next unvisited point (6, 4). Its ε-
neighborhood contains the points (5, 5), (8, 2), (8,
4), and (9, 5). The ε-neighborhood has MinPts (3) or
more points, so we create a new cluster and assign
these points to the cluster.
•
• Cluster 2: [(6, 4), (5, 5), (8, 2), (8, 4), (9, 5)]
•
• All the points have been visited, and we have two
clusters and one noise point.
• Cluster 1: [(2, 3), (2, 5), (3, 4), (5, 3)]
• Cluster 2: [(6, 4), (5, 5), (8, 2), (8, 4), (9, 5)]
• Noise: [(5, 5)]
•
• In this example, the DBSCAN algorithm
identified two clusters and one noise point in
the dataset based on the specified parameters
ε = 1.5 and MinPts = 3.
• importnumpy as np
• fromsklearn.cluster import DBSCAN
• fromsklearn.metrics import silhouette_samples, silhouette_score
•
• # Generate sample data
• X = np.array([[2, 3], [2, 5], [3, 4], [5, 3], [5, 5], [6, 4], [8, 2], [8, 4], [9, 5]])
•
• # Perform DBSCAN clustering
• dbscan = DBSCAN(eps=1.5, min_samples=3)
• cluster_labels = dbscan.fit_predict(X)
•
• # Exclude noise points (-1) from silhouette analysis
• valid_labels = cluster_labels != -1
• valid_X = X[valid_labels]
• valid_cluster_labels = cluster_labels[valid_labels]
•
• # Compute silhouette scores
• silhouette_avg = silhouette_score(valid_X, valid_cluster_labels)
• sample_silhouette_values = silhouette_samples(valid_X,
valid_cluster_labels)
•
• # Print the silhouette score and silhouette values for each
sample
• print("Silhouette Score:", silhouette_avg)
• for i, label in enumerate(valid_cluster_labels):
• print("Sample", i+1, " - Cluster:", label, " - Silhouette Value:",
sample_silhouette_values[i])