Density-Based Clustering
Izabela Moise, Evangelos Pournaras, Dirk Helbing
Izabela Moise, Evangelos Pournaras, Dirk Helbing 1
Reminder
Unsupervised data mining
X Clustering
→ k -Means
Izabela Moise, Evangelos Pournaras, Dirk Helbing 2
Main Clustering Approaches
• Partitioning method
→ constructs partitions of data points
→ evaluates the partitions by some criterion
→ k -means, k -medoids
• Density-based method:
→ based on connectivity and density functions
→ DBSCAN, DJCluster
Izabela Moise, Evangelos Pournaras, Dirk Helbing 3
Density-Based Clustering
Izabela Moise, Evangelos Pournaras, Dirk Helbing 4
Density-Based Clustering
Density-Based Clustering
locates regions of high density that are separated from one another
by regions of low density.
Izabela Moise, Evangelos Pournaras, Dirk Helbing 4
Main principles
• Two parameters:
1. maximum radius of the neighbourhood → Eps
2. minimum number of points in an Eps neighbourhood of a point
→ MinPts
• NEps (p) : {q ∈ D s.t. dist(p, q) ≤ Eps }
• Key idea: the density of the neighbourhood has to exceed
some threshold.
• The shape of a neighbourhood depends on the dist function
Izabela Moise, Evangelos Pournaras, Dirk Helbing 5
Main principles
• Two parameters:
1. maximum radius of the neighbourhood → Eps
2. minimum number of points in an Eps neighbourhood of a point
→ MinPts
• NEps (p) : {q ∈ D s.t. dist(p, q) ≤ Eps }
• Key idea: the density of the neighbourhood has to exceed
some threshold.
• The shape of a neighbourhood depends on the dist function
Izabela Moise, Evangelos Pournaras, Dirk Helbing 5
Main principles
• Two parameters:
1. maximum radius of the neighbourhood → Eps
2. minimum number of points in an Eps neighbourhood of a point
→ MinPts
• NEps (p) : {q ∈ D s.t. dist(p, q) ≤ Eps }
• Key idea: the density of the neighbourhood has to exceed
some threshold.
• The shape of a neighbourhood depends on the dist function
Izabela Moise, Evangelos Pournaras, Dirk Helbing 5
Core, Border and Noise/Outlier
Izabela Moise, Evangelos Pournaras, Dirk Helbing 6
1
Jing Gao, SUNY Buffalo
Directly Density-Reachable
Directly density-reachable:
→ A point p is directly density-reachable from a point q wrt. Eps,
MinPts if:
1. p ∈ NEps (q) and
2. |NEps (q)| ≥ MinPts
Izabela Moise, Evangelos Pournaras, Dirk Helbing 7
Directly Density-Reachable
Directly density-reachable:
→ A point p is directly density-reachable from a point q wrt. Eps,
MinPts if:
1. p ∈ NEps (q) and
2. |NEps (q)| ≥ MinPts
Izabela Moise, Evangelos Pournaras, Dirk Helbing 7
Density-Reachable
• Density-reachable:
→ A point p is density-reachable from a point q wrt. Eps,
MinPts if there is a chain of points p1 , ..., pn , with
p1 = q, pn = p, s.t.pi+1 is directly density reachable from pi
• transitive but not symmetric
Izabela Moise, Evangelos Pournaras, Dirk Helbing 8
Density-Connected
Density-connected:
→ A point p is density-connected from a point q wrt. Eps, MinPts if
there is a point o s.t. p and q are density-reachable from o wrt. Eps
and MinPts
Izabela Moise, Evangelos Pournaras, Dirk Helbing 9
Density-Connected
Density-connected:
→ A point p is density-connected from a point q wrt. Eps, MinPts if
there is a point o s.t. p and q are density-reachable from o wrt. Eps
and MinPts
→ symmetric
Izabela Moise, Evangelos Pournaras, Dirk Helbing 9
Density-Connected
Density-connected:
→ A point p is density-connected from a point q wrt. Eps, MinPts if
there is a point o s.t. p and q are density-reachable from o wrt. Eps
and MinPts
→ symmetric
Izabela Moise, Evangelos Pournaras, Dirk Helbing 9
DBSCAN - Density-Based Spatial Clustering of Applications
with Noise
Izabela Moise, Evangelos Pournaras, Dirk Helbing 10
Main Principles
One of the most cited clustering algorithms
Main principle:
a cluster is defined as a maximal set of density-connected points.
• Discovers clusters of arbitrary shapes (spherical, elongated,
linear), and noise
• Works with spatial datasets:
→ geomarketing, tomography, satellite images
• Requires only two parameters (no prior knowledge of the
number of clusters)
Izabela Moise, Evangelos Pournaras, Dirk Helbing 11
Definition: Cluster
2
Erik Kropat, University of the Bundeswehr Munich
Izabela Moise, Evangelos Pournaras, Dirk Helbing 12
Definition: Noise
3
Erik Kropat, University of the Bundeswehr Munich
Izabela Moise, Evangelos Pournaras, Dirk Helbing 13
The Algorithm
1. Randomly select a point p
2. Retrieve all points density-reachable from p wrt. Eps and
MinPts
3. If p is a core point, a cluster is formed
4. If p is a border point, no points are density-reachable from p →
visit the next data point
5. Continue the process until all points have been processed
Izabela Moise, Evangelos Pournaras, Dirk Helbing 14
Selecting Eps and MinPts
The two parameters can be determined by a heuristic
Observation:
• For points in a cluster their k -th nearest neighbours are at
roughly the same distance.
• Noise points have the k -th nearest neighbour at farther
distance.
Izabela Moise, Evangelos Pournaras, Dirk Helbing 15
4
4
Erik Kropat, University of the Bundeswehr Munich
Izabela Moise, Evangelos Pournaras, Dirk Helbing 16
5
5
Erik Kropat, University of the Bundeswehr Munich
Izabela Moise, Evangelos Pournaras, Dirk Helbing 17
6
6
Erik Kropat, University of the Bundeswehr Munich
Izabela Moise, Evangelos Pournaras, Dirk Helbing 18
Pros and Cons
Pros:
X discovers clusters of arbitrary shapes
X handles noise
X needs density parameters as termination condition
Izabela Moise, Evangelos Pournaras, Dirk Helbing 19
Pros and Cons
Cons:
X cannot handle varying densities
X sensitive to parameters → hard to determine the correct set of
parameters
X sampling affects density measures
Izabela Moise, Evangelos Pournaras, Dirk Helbing 20