Clustering
Data Mining:
Data Mining Methods
with Dr. Qin Lv
Learning objective: Apply techniques
for clustering and explain how they work.
Evaluate and compare methods.
Clustering
Ø No predefined
classes
Ø Intra-cluster
similarity
Ø Inter-cluster
dissimilarity
Cluster Analysis
Ø Unsupervised learning
• Group similar objects into clusters
Ø Similarity measure
• Types of objects, similarity/dissimilarity
Ø Clustering method
• Quality, efficiency, incremental
Cluster Evaluation
Ø Clustering tendency
Ø Cluster cohesion & separation
Ø #clusters (e.g., silhouette coefficient)
Ø Comparison with external knowledge
Ø Comparison of two sets of clusters
Types of Clustering Methods
Ø Partitioning methods
Ø Hierarchical methods
Ø Grid-based methods
Ø Density-based methods
Ø Probabilistic methods
Partitioning Methods
Ø Given n objects and #clusters k
• Partition the n objects into k clusters
Ø Brute force approach
• Enumerate all possible partitions
Ø Heuristic methods
• k-means: cluster centroid (mean of objects)
• k-medoids: cluster medoid (“central” object)
k-means Clustering: Method
Ø 1. Pick k initial centroids (e.g., randomly)
Ø 2. Assign each object to nearest centroid
Ø 3. Update each centroid based on objects
assigned to its cluster
Ø Repeat 2. & 3. until centroids are stable
Ø O(nkt): n objects, k clusters, t iterations
k-means Clustering: Example
Ø 10 objects: {35, 69, 9, 78, 9, 23, 81, 57, 15, 48}.
Ø 2 initial centroids: 30, 60
Ø Sort: {9, 9, 15, 23, 35, 48, 57, 69, 78, 81}
Ø R1: 30 {9, 9, 15, 23, 35}, 60 {48, 57, 69, 78, 81}
Ø C_1 = (9 + 9 + 15 + 23 + 35) / 5 = 18.2
Ø C_2 = (48 + 57 + 69 + 78 + 81) / 5 = 66.6
Ø R2: 18.2 {9, 9, 15, 23, 35}, 66.6 {48, 57, 69, 78, 81}
k-means Clustering: Features
Ø Widely-used, efficient and good results
Ø Need to specify k & define centroid
Ø Choice of initial centroids
Ø Not suitable for non-convex shapes
Ø Sensitive to noise & outliers
k-medoids Clustering
Ø Similar process as k-means
Ø Cluster medoid: “central” object
Ø Less sensitive to noise & outliers
Ø Medoid update: computation expensive
Ø Speedup using randomized samples
Hierarchical Clustering: Method
Ø Dendrogram
• Tree of clusters
Ø Agglomerative
• Bottom-up merging
Ø Decisive
• Top-down splitting
Hierarchical Clustering: Features
Ø Useful in many real-world applications
Ø No need to specify #clusters
Ø Need to define cluster distance
Ø Multi-level clustering
Ø Cannot undo cluster merge/split
Grid-based Clustering
Ø Multi-resolution grid structure
• Clusters of different resolutions
• Horizonal & vertical cluster boundaries
Ø Object space => grid cells
• Depends on #cells, easy to parallelize
Ø Statistical information of grid cells
• Incremental processing
Density-based Clustering
Ø Local clusters with high density
• DBSCAN: connected dense neighborhood
• DENCLUE: sum of local influence functions
Ø Key features
• Arbitrary cluster shape, noise tolerant
• Single scan, adjustable density parameters
DBSCAN
Ø Two key parameters
• ε-neighborhood: within radius ε of p
• MinPts: min #points in p’s ε-neighborhood
for p to be considered a core object
Ø Clustering
• Core objects, border objects
• Density-connected, density-reachable
DENCLUE
Ø Influence function
• Object’s impact in its neighborhood
Ø Overall density
• Sum of all objects’ influence function
Ø Density attractors
• Clusters correspond to local maxima
Types of Clustering Methods
Ø Partitioning methods
Ø Hierarchical methods
Ø Grid-based methods
Ø Density-based methods
Ø Probabilistic methods