KEMBAR78
Unit 4 Cluster Analysis 4 | PDF | Cluster Analysis | Data Mining
0% found this document useful (0 votes)
3 views25 pages

Unit 4 Cluster Analysis 4

The document discusses various clustering techniques, including probabilistic model-based clustering, BIRCH, DBSCAN, STING, and methods for assessing clustering tendency like the Hopkins Statistic. It outlines the steps involved in each clustering method, emphasizing the probabilistic nature of assignments, hierarchical structures, density-based clustering, and grid-based approaches. Additionally, it includes practical examples and methods for determining optimal cluster numbers.

Uploaded by

drajalakshmi2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views25 pages

Unit 4 Cluster Analysis 4

The document discusses various clustering techniques, including probabilistic model-based clustering, BIRCH, DBSCAN, STING, and methods for assessing clustering tendency like the Hopkins Statistic. It outlines the steps involved in each clustering method, emphasizing the probabilistic nature of assignments, hierarchical structures, density-based clustering, and grid-based approaches. Additionally, it includes practical examples and methods for determining optimal cluster numbers.

Uploaded by

drajalakshmi2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Probabilistic Model based Clustering

Probabilistic clustering assigns data points to clusters based on


probability distributions. Instead of hard assignments (like k-means),
each point belongs to multiple clusters with certain probabilities.

Steps:
1.​ Assume Data Follows a Distribution​

○​ Typically, Gaussian (Normal) distributions are used.​

2.​ Estimate Parameters for Each Cluster​

○​ Each cluster is defined by mean (μ) and covariance (Σ).​

3.​ Assign Probabilities to Each Point​

○​ Each point is assigned to multiple clusters with different probabilities.​

4.​ Optimize Using Expectation-Maximization (EM)​

○​ EM refines cluster parameters to maximize likelihood.

Example:
Step 4: Update Parameters (Maximization Step)
Using the computed responsibilities,

●​ New means will shift based on probabilities.​

●​ New variances will be recomputed.​

●​ New weights will be adjusted.

Step 5: Repeat Until Convergence


We repeat Steps 2–4 until the clusters stabilize.

Final Cluster Assignment:

●​ Cluster 1: { P1, P2, P3 }​

●​ Cluster 2: { P4, P5, P6 }

Note:

❖​Probabilistic assignment: Each point belongs partially to both clusters.​

❖​Iterative EM updates: New means and variances shift towards data points.​

❖​Soft clustering: Unlike K-means, GMM handles overlapping clusters better.


BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies)

-​ Hierarchical Clustering for large datasets


BIRCH constructs a Clustering Feature Tree (CF Tree) to store summary statistics
instead of keeping all data points.

Construction of CF TREE:

The Clustering Feature (CF) Tree is a hierarchical structure that stores summary
statistics of data points in clusters.

Instead of distance we can use R value :


#Solve the remaining steps:
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a clustering algorithm that groups together points that are closely
packed while marking outliers as noise. It is useful for discovering clusters of
varying shapes and handling noise in data.
DBSCAN Parameters
●​ Epsilon (ε): The radius around a point to consider neighbors.​

●​ MinPts: The minimum number of points required in an ε-neighborhood to form a dense region.​

Steps in DBSCAN

1.​ Label all points as unvisited.​

2.​ Select an unvisited point:​

○​ If it has at least MinPts neighbors within ε, it starts a new cluster.​

○​ If not, mark it as noise (this might later be changed if it belongs to another cluster).​

3.​ Expand the cluster:​

○​ Add all density-reachable points to the cluster.​

○​ Recursively check their neighbors.​

4.​ Repeat until all points are visited.​

5.​ Return the clusters and noise points.

Example

Let’s set:

●​ Epsilon (ε) = 2​

●​ MinPts = 3 (A core point needs at least 3 points, including itself, in its ε-neighborhood)
Step 2: Identify Core, Border, and Noise Points

●​ Core Points (At least 3 points within ε = 2):​

○​ B (Neighbors: {A, C, D})


○​ C (Neighbors: {A, B, D})
●​ Border Points (Density-reachable from a core point but not core itself):​

○​ A (Neighbors: {B, C} → Less than 3) ➝ Border​


D (Neighbors: {B, C} → Less than 3) ➝ Border​

●​ Noise Points (Not density-reachable):​

○​ E (Neighbors: {F} → Less than 3)


○​ F (Neighbors: {E} → Less than 3)

Step 3: Form Clusters

●​ Cluster 1: {A, B, C, D} (Core points B & C, border points A & D)​

●​ Cluster 2: {E, F} (They are alone and do not meet MinPts, so they remain
noise)
Output: Clusters Formed: {A, B, C, D} as one cluster, E and F remain noise.

Density Reachable in DBSCAN

In DBSCAN, a point P is density-reachable from another point Q if:

1.​ Q is a core point (it has at least MinPts neighbors within ε).​

2.​ P is within the ε-neighborhood of Q.​

3.​ There exists a chain of core points leading from Q to P.​

Important:

●​ Core points can reach other core or border points.​

●​ Border points cannot reach other points.​

●​ Noise points are not reachable from any other points.

STING (Statistical Information Grid) Clustering


STING (Statistical Information Grid) is a grid-based clustering technique that
divides the data space into hierarchical rectangular cells. It is useful for large spatial
datasets because it processes data efficiently without scanning all data points
directly.
Step: 1

Hierarchical Grid Structure

STING divides the data space into multiple levels of grid cells.

Level 1 (Top Level) → One Large Grid Covering the Entire Data Space

Level 2 (Middle) → Divides into 4 Large Cells

Level 3 (Bottom) → Further Divides Each into Sub-Cells


Assessing Clustering Tendency with Hopkins Statistic

The Hopkins Statistic is a method used to test whether a dataset has a


non-random (clustered) structure or is uniformly distributed. It helps to
determine whether clustering algorithms are likely to be meaningful on a
dataset.
Steps:
1.​ Given a dataset D, sample n random data points from it: p1, p2, ..., pn.​

2.​ For each pi, compute the distance to its nearest neighbor in D → gives values xi.​

3.​ Sample another n artificial (random) points uniformly in the data space: q1, q2, ..., qn.​

4.​ For each qi, compute the distance to its nearest neighbor in D (excluding itself) → gives
values yi.​

5.​ Compute the Hopkins Statistic H:

If D is uniformly distributed, Σ xi and Σ yi will be close to each other and H is close to 0.5.

If D is highly skewed, H is close to 0

Interpretation: Since H ≈ 0.72, this indicates the data has a strong clustering
tendency.
Empirical Method :

Elbow Method

Let's assume a simple 2D dataset with 300 points (e.g., customer data with Age vs. Spending
Score).
Cross Validation Method
Step 3: Interpret

Even though SSD continues to decrease, the drop after k=4 is minimal.

Optimal k = 4
Example:

You might also like