Probabilistic Model based Clustering
Probabilistic clustering assigns data points to clusters based on
probability distributions. Instead of hard assignments (like k-means),
each point belongs to multiple clusters with certain probabilities.
Steps:
1. Assume Data Follows a Distribution
○ Typically, Gaussian (Normal) distributions are used.
2. Estimate Parameters for Each Cluster
○ Each cluster is defined by mean (μ) and covariance (Σ).
3. Assign Probabilities to Each Point
○ Each point is assigned to multiple clusters with different probabilities.
4. Optimize Using Expectation-Maximization (EM)
○ EM refines cluster parameters to maximize likelihood.
Example:
Step 4: Update Parameters (Maximization Step)
Using the computed responsibilities,
● New means will shift based on probabilities.
● New variances will be recomputed.
● New weights will be adjusted.
Step 5: Repeat Until Convergence
We repeat Steps 2–4 until the clusters stabilize.
Final Cluster Assignment:
● Cluster 1: { P1, P2, P3 }
● Cluster 2: { P4, P5, P6 }
Note:
❖Probabilistic assignment: Each point belongs partially to both clusters.
❖Iterative EM updates: New means and variances shift towards data points.
❖Soft clustering: Unlike K-means, GMM handles overlapping clusters better.
BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies)
- Hierarchical Clustering for large datasets
BIRCH constructs a Clustering Feature Tree (CF Tree) to store summary statistics
instead of keeping all data points.
Construction of CF TREE:
The Clustering Feature (CF) Tree is a hierarchical structure that stores summary
statistics of data points in clusters.
Instead of distance we can use R value :
#Solve the remaining steps:
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN is a clustering algorithm that groups together points that are closely
packed while marking outliers as noise. It is useful for discovering clusters of
varying shapes and handling noise in data.
DBSCAN Parameters
● Epsilon (ε): The radius around a point to consider neighbors.
● MinPts: The minimum number of points required in an ε-neighborhood to form a dense region.
Steps in DBSCAN
1. Label all points as unvisited.
2. Select an unvisited point:
○ If it has at least MinPts neighbors within ε, it starts a new cluster.
○ If not, mark it as noise (this might later be changed if it belongs to another cluster).
3. Expand the cluster:
○ Add all density-reachable points to the cluster.
○ Recursively check their neighbors.
4. Repeat until all points are visited.
5. Return the clusters and noise points.
Example
Let’s set:
● Epsilon (ε) = 2
● MinPts = 3 (A core point needs at least 3 points, including itself, in its ε-neighborhood)
Step 2: Identify Core, Border, and Noise Points
● Core Points (At least 3 points within ε = 2):
○ B (Neighbors: {A, C, D})
○ C (Neighbors: {A, B, D})
● Border Points (Density-reachable from a core point but not core itself):
○ A (Neighbors: {B, C} → Less than 3) ➝ Border
D (Neighbors: {B, C} → Less than 3) ➝ Border
● Noise Points (Not density-reachable):
○ E (Neighbors: {F} → Less than 3)
○ F (Neighbors: {E} → Less than 3)
Step 3: Form Clusters
● Cluster 1: {A, B, C, D} (Core points B & C, border points A & D)
● Cluster 2: {E, F} (They are alone and do not meet MinPts, so they remain
noise)
Output: Clusters Formed: {A, B, C, D} as one cluster, E and F remain noise.
Density Reachable in DBSCAN
In DBSCAN, a point P is density-reachable from another point Q if:
1. Q is a core point (it has at least MinPts neighbors within ε).
2. P is within the ε-neighborhood of Q.
3. There exists a chain of core points leading from Q to P.
Important:
● Core points can reach other core or border points.
● Border points cannot reach other points.
● Noise points are not reachable from any other points.
STING (Statistical Information Grid) Clustering
STING (Statistical Information Grid) is a grid-based clustering technique that
divides the data space into hierarchical rectangular cells. It is useful for large spatial
datasets because it processes data efficiently without scanning all data points
directly.
Step: 1
Hierarchical Grid Structure
STING divides the data space into multiple levels of grid cells.
Level 1 (Top Level) → One Large Grid Covering the Entire Data Space
Level 2 (Middle) → Divides into 4 Large Cells
Level 3 (Bottom) → Further Divides Each into Sub-Cells
Assessing Clustering Tendency with Hopkins Statistic
The Hopkins Statistic is a method used to test whether a dataset has a
non-random (clustered) structure or is uniformly distributed. It helps to
determine whether clustering algorithms are likely to be meaningful on a
dataset.
Steps:
1. Given a dataset D, sample n random data points from it: p1, p2, ..., pn.
2. For each pi, compute the distance to its nearest neighbor in D → gives values xi.
3. Sample another n artificial (random) points uniformly in the data space: q1, q2, ..., qn.
4. For each qi, compute the distance to its nearest neighbor in D (excluding itself) → gives
values yi.
5. Compute the Hopkins Statistic H:
If D is uniformly distributed, Σ xi and Σ yi will be close to each other and H is close to 0.5.
If D is highly skewed, H is close to 0
Interpretation: Since H ≈ 0.72, this indicates the data has a strong clustering
tendency.
Empirical Method :
Elbow Method
Let's assume a simple 2D dataset with 300 points (e.g., customer data with Age vs. Spending
Score).
Cross Validation Method
Step 3: Interpret
Even though SSD continues to decrease, the drop after k=4 is minimal.
Optimal k = 4
Example: