KEMBAR78
DBSCAN - Introduction in Machine Learning. | PDF | Cluster Analysis | Artificial Intelligence
0% found this document useful (0 votes)
20 views3 pages

DBSCAN - Introduction in Machine Learning.

DBSCAN is a clustering algorithm that classifies points into core, border, and noise categories based on two parameters: ε (epsilon) and MinPts. It can identify clusters of arbitrary shape and is robust to noise, but struggles with varying densities and high-dimensional data. The document also provides guidance on parameter selection, advantages, limitations, and a Python implementation example.

Uploaded by

mb18ak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views3 pages

DBSCAN - Introduction in Machine Learning.

DBSCAN is a clustering algorithm that classifies points into core, border, and noise categories based on two parameters: ε (epsilon) and MinPts. It can identify clusters of arbitrary shape and is robust to noise, but struggles with varying densities and high-dimensional data. The document also provides guidance on parameter selection, advantages, limitations, and a Python implementation example.

Uploaded by

mb18ak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

DBSCAN works with two main parameters:

1. ε (epsilon): The maximum distance between two points for them to be considered as
part of the same neighborhood.
2. MinPts: The minimum number of points required to form a dense region.

Based on these, each point is classified as:

• Core Point: Has at least MinPts points within ε (including itself).


• Border Point: Has fewer than MinPts within ε, but is in the neighborhood of a core
point.
• Noise (Outlier): Neither a core point nor a border point.

DBSCAN Algorithm Steps


1. Start with an arbitrary unvisited point.
2. Retrieve the ε-neighborhood of the point.
3. If the neighborhood contains at least MinPts, start a new cluster.
o Mark the point as core.
o Expand the cluster recursively by including all density-reachable points (those
reachable from core points through a chain of neighboring core points).
4. If not, mark the point as noise (it might later be included in a cluster as a border
point).
5. Repeat until all points have been visited.

Advantages of DBSCAN
• Can find clusters of arbitrary shape (unlike K-Means which assumes spherical
clusters).
• Robust to noise and outliers.
• No need to specify the number of clusters beforehand (unlike K-Means or Gaussian
Mixture Models).

Limitations
• DBSCAN struggles when the dataset has varying densities, as a global ε and MinPts
may not work well for all clusters.
• Performance can degrade with high-dimensional data due to the "curse of
dimensionality".
• Choosing optimal ε and MinPts can be non-trivial and often requires domain
knowledge or empirical testing (e.g., using a k-distance graph to find a good ε).
Comparison with Other Clustering Algorithms
K-
Feature DBSCAN Hierarchical
Means
Cluster shape Arbitrary Spherical Arbitrary
Handles noise Yes No No
No (but need linkage
Need to specify K? No Yes
criterion)
Handles varying
Not well No No
density
O(n log n) (with
Time complexity O(nkdi) O(n²)
indexing)

DBSCAN using Python with scikit-learn and matplotlib.

🧪 Step-by-Step Code Example


python
CopyEdit
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN

# Generate sample data (two moons shape – non-convex clusters)


X, _ = make_moons(n_samples=300, noise=0.05, random_state=0)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)

# Plotting results
plt.figure(figsize=(8, 6))
unique_labels = set(labels)

# Assign colors to each cluster, black for noise


colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1,
len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black for noise
col = [0, 0, 0, 1]

class_member_mask = (labels == k)
xy = X[class_member_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)

plt.title("DBSCAN Clustering on Two Moons Dataset")


plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()
🧠 Tips on Choosing Parameters

• ε (eps): Try plotting a k-distance graph (distance to the kth nearest neighbor) and
look for the "elbow" point.
• min_samples: A common heuristic is min_samples = 2 * n_features.

MinPts = D + 1

where D is the number of dimensions in your data. This ensures that core points are
surrounded by at least as many points as dimensions plus one (to avoid counting
noise or surface points as core).

• For 2D or 3D data, MinPts = 4 or 5 is often a good start.


• In high-dimensional datasets, set MinPts between 2×D and 4×D.

✅ When to increase MinPts:

• If your dataset has noise, increase MinPts to reduce false positives.


• A larger MinPts results in more conservative (denser) clusters.

📌 2. Choosing ε (Epsilon Radius)

This is more sensitive and often the parameter that requires tuning. A small change in ε can
drastically change the clustering outcome.

✅ Use a k-distance graph:

This is the most widely used technique. Here’s how it works:

Summary Cheat Sheet


Parameter Strategy
MinPts Set to D + 1 or 2D, where D = number of dimensions
ε Use a k-distance graph; pick value at the "elbow"
Fine-tuning Try a grid search on a small dataset subset

Gotchas to Avoid
• Don't set ε too small → All points become noise.
• Don't set ε too large → Merges all points into one big cluster.
• No single perfect choice → Varies by data distribution and density.

You might also like