10.
4 Density-Based Methods 471
10.4 Density-Based Methods
Partitioning and hierarchical methods are designed to find spherical-shaped clusters.
They have difficulty finding clusters of arbitrary shape such as the “S” shape and oval
clusters in Figure 10.13. Given such data, they would likely inaccurately identify convex
regions, where noise or outliers are included in the clusters.
To find clusters of arbitrary shape, alternatively, we can model clusters as dense
regions in the data space, separated by sparse regions. This is the main strategy behind
density-based clustering methods, which can discover clusters of nonspherical shape.
In this section, you will learn the basic techniques of density-based clustering by
studying three representative methods, namely, DBSCAN (Section 10.4.1), OPTICS
(Section 10.4.2), and DENCLUE (Section 10.4.3).
10.4.1 DBSCAN: Density-Based Clustering Based on Connected
Regions with High Density
“How can we find dense regions in density-based clustering?” The density of an object o
can be measured by the number of objects close to o. DBSCAN (Density-Based Spatial
Clustering of Applications with Noise) finds core objects, that is, objects that have dense
neighborhoods. It connects core objects and their neighborhoods to form dense regions
as clusters.
“How does DBSCAN quantify the neighborhood of an object?” A user-specified para-
meter > 0 is used to specify the radius of a neighborhood we consider for every object.
The -neighborhood of an object o is the space within a radius centered at o.
Due to the fixed neighborhood size parameterized by , the density of a neighbor-
hood can be measured simply by the number of objects in the neighborhood. To deter-
mine whether a neighborhood is dense or not, DBSCAN uses another user-specified
Figure 10.13 Clusters of arbitrary shape.
472 Chapter 10 Cluster Analysis: Basic Concepts and Methods
parameter, MinPts, which specifies the density threshold of dense regions. An object is
a core object if the -neighborhood of the object contains at least MinPts objects. Core
objects are the pillars of dense regions.
Given a set, D, of objects, we can identify all core objects with respect to the given
parameters, and MinPts. The clustering task is therein reduced to using core objects
and their neighborhoods to form dense regions, where the dense regions are clusters.
For a core object q and an object p, we say that p is directly density-reachable from q
(with respect to and MinPts) if p is within the -neighborhood of q. Clearly, an object
p is directly density-reachable from another object q if and only if q is a core object and
p is in the -neighborhood of q. Using the directly density-reachable relation, a core
object can “bring” all objects from its -neighborhood into a dense region.
“How can we assemble a large dense region using small dense regions centered by core
objects?” In DBSCAN, p is density-reachable from q (with respect to and MinPts in
D) if there is a chain of objects p1 , . . . , pn , such that p1 = q, pn = p, and pi+1 is directly
density-reachable from pi with respect to and MinPts, for 1 ≤ i ≤ n, pi ∈ D. Note that
density-reachability is not an equivalence relation because it is not symmetric. If both o1
and o2 are core objects and o1 is density-reachable from o2 , then o2 is density-reachable
from o1 . However, if o2 is a core object but o1 is not, then o1 may be density-reachable
from o2 , but not vice versa.
To connect core objects as well as their neighbors in a dense region, DBSCAN uses
the notion of density-connectedness. Two objects p1 , p2 ∈ D are density-connected with
respect to and MinPts if there is an object q ∈ D such that both p1 and p2 are density-
reachable from q with respect to and MinPts. Unlike density-reachability, density-
connectedness is an equivalence relation. It is easy to show that, for objects o1 , o2 , and
o3 , if o1 and o2 are density-connected, and o2 and o3 are density-connected, then so are
o1 and o3 .
Example 10.7 Density-reachability and density-connectivity. Consider Figure 10.14 for a given
represented by the radius of the circles, and, say, let MinPts = 3.
Of the labeled points, m, p, o, r are core objects because each is in an -neighborhood
containing at least three points. Object q is directly density-reachable from m. Object m
is directly density-reachable from p and vice versa.
Object q is (indirectly) density-reachable from p because q is directly density-
reachable from m and m is directly density-reachable from p. However, p is not density-
reachable from q because q is not a core object. Similarly, r and s are density-reachable
from o and o is density-reachable from r. Thus, o, r, and s are all density-connected.
We can use the closure of density-connectedness to find connected dense regions as
clusters. Each closed set is a density-based cluster. A subset C ⊆ D is a cluster if (1)
for any two objects o1 , o2 ∈ C, o1 and o2 are density-connected; and (2) there does not
exist an object o ∈ C and another object o0 ∈ (D − C) such that o and o0 are density-
connected.
10.4 Density-Based Methods 473
m
r
p
s
Figure 10.14 Density-reachability and density-connectivity in density-based clustering. Source: Based on
Ester, Kriegel, Sander, and Xu [EKSX96].
“How does DBSCAN find clusters?” Initially, all objects in a given data set D are
marked as “unvisited.” DBSCAN randomly selects an unvisited object p, marks p as
“visited,” and checks whether the -neighborhood of p contains at least MinPts objects.
If not, p is marked as a noise point. Otherwise, a new cluster C is created for p, and all
the objects in the -neighborhood of p are added to a candidate set, N . DBSCAN iter-
atively adds to C those objects in N that do not belong to any cluster. In this process,
for an object p0 in N that carries the label “unvisited,” DBSCAN marks it as “visited” and
checks its -neighborhood. If the -neighborhood of p0 has at least MinPts objects, those
objects in the -neighborhood of p0 are added to N . DBSCAN continues adding objects
to C until C can no longer be expanded, that is, N is empty. At this time, cluster C is
completed, and thus is output.
To find the next cluster, DBSCAN randomly selects an unvisited object from the
remaining ones. The clustering process continues until all objects are visited. The
pseudocode of the DBSCAN algorithm is given in Figure 10.15.
If a spatial index is used, the computational complexity of DBSCAN is O(n log n),
where n is the number of database objects. Otherwise, the complexity is O(n2 ). With
appropriate settings of the user-defined parameters, and MinPts, the algorithm is
effective in finding arbitrary-shaped clusters.
10.4.2 OPTICS: Ordering Points to Identify
the Clustering Structure
Although DBSCAN can cluster objects given input parameters such as (the maxi-
mum radius of a neighborhood) and MinPts (the minimum number of points required
in the neighborhood of a core object), it encumbers users with the responsibility of
selecting parameter values that will lead to the discovery of acceptable clusters. This is
a problem associated with many other clustering algorithms. Such parameter settings