KEMBAR78
DBSCAN | PDF | Cluster Analysis | Theoretical Computer Science
0% found this document useful (0 votes)
34 views29 pages

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised clustering algorithm that identifies clusters of varying shapes and densities, effectively handling noise and outliers. It requires two parameters: epsilon (ε), the maximum distance for points to be considered neighbors, and MinPts, the minimum number of points to form a dense region. The algorithm iteratively identifies core points and expands clusters based on density connectivity, distinguishing between cluster points and noise.

Uploaded by

ksheoran1213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views29 pages

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised clustering algorithm that identifies clusters of varying shapes and densities, effectively handling noise and outliers. It requires two parameters: epsilon (ε), the maximum distance for points to be considered neighbors, and MinPts, the minimum number of points to form a dense region. The algorithm iteratively identifies core points and expands clusters based on density connectivity, distinguishing between cluster points and noise.

Uploaded by

ksheoran1213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

DBSCAN

Introduction
• Clustering analysis or simply Clustering is basically an
Unsupervised learning method that divides the data
points into a number of specific batches or groups, such
that the data points in the same groups have similar
properties and data points in different groups have
different properties in some sense.
• Partitioning methods (K-means, PAM (portioning
around medoids) clustering) and hierarchical clustering
work for finding spherical-shaped clusters or convex
clusters. In other words, they are suitable only for
compact and well-separated clusters.
DBSCAN Clustering

• DBSCAN stands for "Density-Based Spatial


Clustering of Applications with Noise."
• It's a popular clustering algorithm used to
identify clusters of data points in a dataset,
particularly in cases where the clusters might
have complex shapes and varying densities.
• DBSCAN is effective at finding clusters in data
with noise and outliers as well.
• Real-life data may contain irregularities, like:
• Clusters can be of arbitrary shape such as those shown in the figure
below.
• Data may contain noise.
• The figure shows a data set containing non-convex shape clusters and
outliers. Given such data, the k-means algorithm has difficulties in
identifying these clusters with arbitrary shapes.
Parameters Required For DBSCAN Algorithm

• Epsilon (ε): The maximum distance between two


points for them to be considered as neighbors.
• MinPts: The minimum number of points required to
form a dense region (core points).
• Minimum number of neighbors (data points) within
eps radius. The larger the dataset, the larger value
of MinPts must be chosen. As a general rule, the
minimum MinPts can be derived from the number
of dimensions D in the dataset as, MinPts >= D+1.
The minimum value of MinPts must be chosen at
least 3.
• Let's set ε = 1.5 and MinPts = 3.
• We start by randomly selecting a point from
the dataset and check its ε-neighborhood. If
the ε-neighborhood contains at least MinPts
points, we consider it a core point and expand
its cluster. We repeat this process until all the
points are assigned to a cluster or identified as
noise.
Steps Used In DBSCAN Algorithm
• Find all the neighbor points within eps and identify the core points or
visited with more than MinPts neighbors.
• For each core point if it is not already assigned to a cluster, create a new
cluster.
• Find recursively all its density-connected points and assign them to the
same cluster as the core point.
A point a and b are said to be density connected if there exists a
point c which has a sufficient number of points in its neighbors and both
points a and b are within the eps distance. This is a chaining process. So,
if b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e, which
in turn is neighbor of a implying that b is a neighbor of a.
• Iterate through the remaining unvisited points in the dataset. Those
points that do not belong to any cluster are noise.
P9 is noise here
Here's a step-by-step breakdown of the DBSCAN algorithm for this
example:

• Suppose we have the following dataset of 2D points:



• Points: [(2, 3), (2, 5), (3, 4), (5, 3), (5, 5), (6, 4), (8, 2), (8,
4), (9, 5)]

• Start with the first point (2, 3). Its ε-neighborhood


contains the points (2, 5), (3, 4), and (5, 3). Since the ε-
neighborhood has MinPts (3) or more points, we create
a new cluster and assign these points to the cluster.
• Cluster 1: [(2, 3), (2, 5), (3, 4), (5, 3)]
• Next, we move to the next unvisited point (5,
5). Its ε-neighborhood contains the point (6,
4). Since the ε-neighborhood has fewer than
MinPts (3) points, (5, 5) is marked as noise.

• Noise: [(5, 5)]
• We move to the next unvisited point (6, 4). Its ε-
neighborhood contains the points (5, 5), (8, 2), (8,
4), and (9, 5). The ε-neighborhood has MinPts (3) or
more points, so we create a new cluster and assign
these points to the cluster.

• Cluster 2: [(6, 4), (5, 5), (8, 2), (8, 4), (9, 5)]

• All the points have been visited, and we have two
clusters and one noise point.
• Cluster 1: [(2, 3), (2, 5), (3, 4), (5, 3)]
• Cluster 2: [(6, 4), (5, 5), (8, 2), (8, 4), (9, 5)]
• Noise: [(5, 5)]

• In this example, the DBSCAN algorithm
identified two clusters and one noise point in
the dataset based on the specified parameters
ε = 1.5 and MinPts = 3.
• importnumpy as np
• fromsklearn.cluster import DBSCAN
• fromsklearn.metrics import silhouette_samples, silhouette_score

• # Generate sample data
• X = np.array([[2, 3], [2, 5], [3, 4], [5, 3], [5, 5], [6, 4], [8, 2], [8, 4], [9, 5]])

• # Perform DBSCAN clustering
• dbscan = DBSCAN(eps=1.5, min_samples=3)
• cluster_labels = dbscan.fit_predict(X)

• # Exclude noise points (-1) from silhouette analysis
• valid_labels = cluster_labels != -1
• valid_X = X[valid_labels]
• valid_cluster_labels = cluster_labels[valid_labels]

• # Compute silhouette scores
• silhouette_avg = silhouette_score(valid_X, valid_cluster_labels)
• sample_silhouette_values = silhouette_samples(valid_X,
valid_cluster_labels)

• # Print the silhouette score and silhouette values for each
sample
• print("Silhouette Score:", silhouette_avg)
• for i, label in enumerate(valid_cluster_labels):
• print("Sample", i+1, " - Cluster:", label, " - Silhouette Value:",
sample_silhouette_values[i])

You might also like