Density-Based Clustering
Izabela Moise, Evangelos Pournaras, Dirk Helbing
                                         Izabela Moise, Evangelos Pournaras, Dirk Helbing   1
Reminder
Unsupervised data mining
X Clustering
  → k -Means
                  Izabela Moise, Evangelos Pournaras, Dirk Helbing   2
Main Clustering Approaches
 • Partitioning method
    → constructs partitions of data points
    → evaluates the partitions by some criterion
    → k -means, k -medoids
 • Density-based method:
    → based on connectivity and density functions
    → DBSCAN, DJCluster
                                      Izabela Moise, Evangelos Pournaras, Dirk Helbing   3
Density-Based Clustering
                           Izabela Moise, Evangelos Pournaras, Dirk Helbing   4
Density-Based Clustering
Density-Based Clustering
locates regions of high density that are separated from one another
by regions of low density.
                                      Izabela Moise, Evangelos Pournaras, Dirk Helbing   4
Main principles
 • Two parameters:
     1. maximum radius of the neighbourhood → Eps
     2. minimum number of points in an Eps neighbourhood of a point
        → MinPts
 • NEps (p) : {q ∈ D s.t. dist(p, q) ≤ Eps }
 • Key idea: the density of the neighbourhood has to exceed
   some threshold.
 • The shape of a neighbourhood depends on the dist function
                                      Izabela Moise, Evangelos Pournaras, Dirk Helbing   5
Main principles
 • Two parameters:
     1. maximum radius of the neighbourhood → Eps
     2. minimum number of points in an Eps neighbourhood of a point
        → MinPts
 • NEps (p) : {q ∈ D s.t. dist(p, q) ≤ Eps }
 • Key idea: the density of the neighbourhood has to exceed
   some threshold.
 • The shape of a neighbourhood depends on the dist function
                                      Izabela Moise, Evangelos Pournaras, Dirk Helbing   5
Main principles
 • Two parameters:
     1. maximum radius of the neighbourhood → Eps
     2. minimum number of points in an Eps neighbourhood of a point
        → MinPts
 • NEps (p) : {q ∈ D s.t. dist(p, q) ≤ Eps }
 • Key idea: the density of the neighbourhood has to exceed
   some threshold.
 • The shape of a neighbourhood depends on the dist function
                                      Izabela Moise, Evangelos Pournaras, Dirk Helbing   5
Core, Border and Noise/Outlier
                              Izabela Moise, Evangelos Pournaras, Dirk Helbing   6
 1
     Jing Gao, SUNY Buffalo
Directly Density-Reachable
Directly density-reachable:
→ A point p is directly density-reachable from a point q wrt. Eps,
MinPts if:
 1. p ∈ NEps (q) and
 2. |NEps (q)| ≥ MinPts
                                      Izabela Moise, Evangelos Pournaras, Dirk Helbing   7
Directly Density-Reachable
Directly density-reachable:
→ A point p is directly density-reachable from a point q wrt. Eps,
MinPts if:
 1. p ∈ NEps (q) and
 2. |NEps (q)| ≥ MinPts
                                      Izabela Moise, Evangelos Pournaras, Dirk Helbing   7
Density-Reachable
 • Density-reachable:
   → A point p is density-reachable from a point q wrt. Eps,
   MinPts if there is a chain of points p1 , ..., pn , with
   p1 = q, pn = p, s.t.pi+1 is directly density reachable from pi
 • transitive but not symmetric
                                      Izabela Moise, Evangelos Pournaras, Dirk Helbing   8
Density-Connected
Density-connected:
→ A point p is density-connected from a point q wrt. Eps, MinPts if
there is a point o s.t. p and q are density-reachable from o wrt. Eps
and MinPts
                                       Izabela Moise, Evangelos Pournaras, Dirk Helbing   9
Density-Connected
Density-connected:
→ A point p is density-connected from a point q wrt. Eps, MinPts if
there is a point o s.t. p and q are density-reachable from o wrt. Eps
and MinPts
→ symmetric
                                       Izabela Moise, Evangelos Pournaras, Dirk Helbing   9
Density-Connected
Density-connected:
→ A point p is density-connected from a point q wrt. Eps, MinPts if
there is a point o s.t. p and q are density-reachable from o wrt. Eps
and MinPts
→ symmetric
                                       Izabela Moise, Evangelos Pournaras, Dirk Helbing   9
DBSCAN - Density-Based Spatial Clustering of Applications
with Noise
                                 Izabela Moise, Evangelos Pournaras, Dirk Helbing   10
Main Principles
             One of the most cited clustering algorithms
Main principle:
a cluster is defined as a maximal set of density-connected points.
  • Discovers clusters of arbitrary shapes (spherical, elongated,
    linear), and noise
  • Works with spatial datasets:
      → geomarketing, tomography, satellite images
  • Requires only two parameters (no prior knowledge of the
    number of clusters)
                                     Izabela Moise, Evangelos Pournaras, Dirk Helbing   11
Definition: Cluster
 2
     Erik Kropat, University of the Bundeswehr Munich
                                                        Izabela Moise, Evangelos Pournaras, Dirk Helbing   12
Definition: Noise
 3
     Erik Kropat, University of the Bundeswehr Munich
                                                        Izabela Moise, Evangelos Pournaras, Dirk Helbing   13
The Algorithm
 1. Randomly select a point p
 2. Retrieve all points density-reachable from p wrt. Eps and
    MinPts
 3. If p is a core point, a cluster is formed
 4. If p is a border point, no points are density-reachable from p →
    visit the next data point
 5. Continue the process until all points have been processed
                                       Izabela Moise, Evangelos Pournaras, Dirk Helbing   14
Selecting Eps and MinPts
              The two parameters can be determined by a heuristic
Observation:
  • For points in a cluster their k -th nearest neighbours are at
    roughly the same distance.
  • Noise points have the k -th nearest neighbour at farther
    distance.
                                      Izabela Moise, Evangelos Pournaras, Dirk Helbing   15
                                                                                                               4
4
    Erik Kropat, University of the Bundeswehr Munich
                                                       Izabela Moise, Evangelos Pournaras, Dirk Helbing   16
                                                                                                               5
5
    Erik Kropat, University of the Bundeswehr Munich
                                                       Izabela Moise, Evangelos Pournaras, Dirk Helbing   17
                                                                                                               6
6
    Erik Kropat, University of the Bundeswehr Munich
                                                       Izabela Moise, Evangelos Pournaras, Dirk Helbing   18
Pros and Cons
Pros:
 X discovers clusters of arbitrary shapes
 X handles noise
 X needs density parameters as termination condition
                                    Izabela Moise, Evangelos Pournaras, Dirk Helbing   19
Pros and Cons
Cons:
 X cannot handle varying densities
 X sensitive to parameters → hard to determine the correct set of
   parameters
 X sampling affects density measures
                                     Izabela Moise, Evangelos Pournaras, Dirk Helbing   20