0% found this document useful (0 votes)

3 views25 pages

Unit 4 Cluster Analysis 4

The document discusses various clustering techniques, including probabilistic model-based clustering, BIRCH, DBSCAN, STING, and methods for assessing clustering tendency like the Hopkins Statistic. It outlines the steps involved in each clustering method, emphasizing the probabilistic nature of assignments, hierarchical structures, density-based clustering, and grid-based approaches. Additionally, it includes practical examples and methods for determining optimal cluster numbers.

Uploaded by

drajalakshmi2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views25 pages

Unit 4 Cluster Analysis 4

Uploaded by

drajalakshmi2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Probabilistic Model based Clustering

Probabilistic clustering assigns data points to clusters based on

probability distributions. Instead of hard assignments (like k-means),
each point belongs to multiple clusters with certain probabilities.

Steps:
1. Assume Data Follows a Distribution

○ Typically, Gaussian (Normal) distributions are used.

2. Estimate Parameters for Each Cluster

○ Each cluster is defined by mean (μ) and covariance (Σ).

3. Assign Probabilities to Each Point

○ Each point is assigned to multiple clusters with different probabilities.

4. Optimize Using Expectation-Maximization (EM)

○ EM refines cluster parameters to maximize likelihood.

Example:
Step 4: Update Parameters (Maximization Step)
Using the computed responsibilities,

● New means will shift based on probabilities.

● New variances will be recomputed.

● New weights will be adjusted.

Step 5: Repeat Until Convergence

We repeat Steps 2–4 until the clusters stabilize.

Final Cluster Assignment:

● Cluster 1: { P1, P2, P3 }

● Cluster 2: { P4, P5, P6 }

Note:

❖Probabilistic assignment: Each point belongs partially to both clusters.

❖Iterative EM updates: New means and variances shift towards data points.

❖Soft clustering: Unlike K-means, GMM handles overlapping clusters better.

BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies)

- Hierarchical Clustering for large datasets

BIRCH constructs a Clustering Feature Tree (CF Tree) to store summary statistics
instead of keeping all data points.

Construction of CF TREE:

The Clustering Feature (CF) Tree is a hierarchical structure that stores summary
statistics of data points in clusters.

Instead of distance we can use R value :

#Solve the remaining steps:
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a clustering algorithm that groups together points that are closely
packed while marking outliers as noise. It is useful for discovering clusters of
varying shapes and handling noise in data.
DBSCAN Parameters
● Epsilon (ε): The radius around a point to consider neighbors.

● MinPts: The minimum number of points required in an ε-neighborhood to form a dense region.

Steps in DBSCAN

1. Label all points as unvisited.

2. Select an unvisited point:

○ If it has at least MinPts neighbors within ε, it starts a new cluster.

○ If not, mark it as noise (this might later be changed if it belongs to another cluster).

3. Expand the cluster:

○ Add all density-reachable points to the cluster.

○ Recursively check their neighbors.

4. Repeat until all points are visited.

5. Return the clusters and noise points.

Example

Let’s set:

● Epsilon (ε) = 2

● MinPts = 3 (A core point needs at least 3 points, including itself, in its ε-neighborhood)
Step 2: Identify Core, Border, and Noise Points

● Core Points (At least 3 points within ε = 2):

○ B (Neighbors: {A, C, D})

○ C (Neighbors: {A, B, D})
● Border Points (Density-reachable from a core point but not core itself):

○ A (Neighbors: {B, C} → Less than 3) ➝ Border

D (Neighbors: {B, C} → Less than 3) ➝ Border

● Noise Points (Not density-reachable):

○ E (Neighbors: {F} → Less than 3)

○ F (Neighbors: {E} → Less than 3)

Step 3: Form Clusters

● Cluster 1: {A, B, C, D} (Core points B & C, border points A & D)

● Cluster 2: {E, F} (They are alone and do not meet MinPts, so they remain
noise)
Output: Clusters Formed: {A, B, C, D} as one cluster, E and F remain noise.

Density Reachable in DBSCAN

In DBSCAN, a point P is density-reachable from another point Q if:

1. Q is a core point (it has at least MinPts neighbors within ε).

2. P is within the ε-neighborhood of Q.

3. There exists a chain of core points leading from Q to P.

Important:

● Core points can reach other core or border points.

● Border points cannot reach other points.

● Noise points are not reachable from any other points.

STING (Statistical Information Grid) Clustering

STING (Statistical Information Grid) is a grid-based clustering technique that
divides the data space into hierarchical rectangular cells. It is useful for large spatial
datasets because it processes data efficiently without scanning all data points
directly.
Step: 1

Hierarchical Grid Structure

STING divides the data space into multiple levels of grid cells.

Level 1 (Top Level) → One Large Grid Covering the Entire Data Space

Level 2 (Middle) → Divides into 4 Large Cells

Level 3 (Bottom) → Further Divides Each into Sub-Cells

Assessing Clustering Tendency with Hopkins Statistic

The Hopkins Statistic is a method used to test whether a dataset has a

non-random (clustered) structure or is uniformly distributed. It helps to
determine whether clustering algorithms are likely to be meaningful on a
dataset.
Steps:
1. Given a dataset D, sample n random data points from it: p1, p2, ..., pn.

2. For each pi, compute the distance to its nearest neighbor in D → gives values xi.

3. Sample another n artificial (random) points uniformly in the data space: q1, q2, ..., qn.

4. For each qi, compute the distance to its nearest neighbor in D (excluding itself) → gives
values yi.

5. Compute the Hopkins Statistic H:

If D is uniformly distributed, Σ xi and Σ yi will be close to each other and H is close to 0.5.

If D is highly skewed, H is close to 0

Interpretation: Since H ≈ 0.72, this indicates the data has a strong clustering
tendency.
Empirical Method :

Elbow Method

Let's assume a simple 2D dataset with 300 points (e.g., customer data with Age vs. Spending
Score).
Cross Validation Method
Step 3: Interpret

Even though SSD continues to decrease, the drop after k=4 is minimal.

Optimal k = 4
Example:

Dbscan: Presented By: Garrett Poppe
No ratings yet
Dbscan: Presented By: Garrett Poppe
22 pages
Clustering Analysis
No ratings yet
Clustering Analysis
12 pages
Dbscan and Optics
No ratings yet
Dbscan and Optics
28 pages
Dbscan
No ratings yet
Dbscan
18 pages
DBSCAN
No ratings yet
DBSCAN
23 pages
DBSCAN - Introduction in Machine Learning.
No ratings yet
DBSCAN - Introduction in Machine Learning.
3 pages
Density Based CA
No ratings yet
Density Based CA
8 pages
Exp5 - Unsupervised Learning
No ratings yet
Exp5 - Unsupervised Learning
13 pages
DBSCAN Clustering
No ratings yet
DBSCAN Clustering
17 pages
Clustering Part2
No ratings yet
Clustering Part2
29 pages
DBSCAN
No ratings yet
DBSCAN
7 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
UNIT-6 DBSCAN Clustering
No ratings yet
UNIT-6 DBSCAN Clustering
6 pages
Unit 8 DBSCAN
No ratings yet
Unit 8 DBSCAN
53 pages
Density Based Clustering (Unit 5)
No ratings yet
Density Based Clustering (Unit 5)
5 pages
Spatial Data Mining: Clustering Techniques
No ratings yet
Spatial Data Mining: Clustering Techniques
56 pages
Unsuper L
No ratings yet
Unsuper L
26 pages
Density Based Clustering
No ratings yet
Density Based Clustering
19 pages
DBSCAN Clustering Explained
No ratings yet
DBSCAN Clustering Explained
3 pages
8 Clustering2
No ratings yet
8 Clustering2
84 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
DBSCAN
No ratings yet
DBSCAN
29 pages
DBSCAN
No ratings yet
DBSCAN
27 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Data Mining
No ratings yet
Data Mining
3 pages
Advanced Clustering for Varied Densities
No ratings yet
Advanced Clustering for Varied Densities
4 pages
4.6 Dbscan
No ratings yet
4.6 Dbscan
27 pages
DB Scan
No ratings yet
DB Scan
7 pages
Density ML
No ratings yet
Density ML
51 pages
Density-Based Clustering Guide
No ratings yet
Density-Based Clustering Guide
21 pages
DM Lect 8 - Clustering - DBSCAN
No ratings yet
DM Lect 8 - Clustering - DBSCAN
22 pages
ML Unit 4
No ratings yet
ML Unit 4
15 pages
Clustering
No ratings yet
Clustering
11 pages
DBSCAN Clustering in ML - Density Based Clustering
No ratings yet
DBSCAN Clustering in ML - Density Based Clustering
5 pages
Unit 4
No ratings yet
Unit 4
16 pages
An Improvement of DBSCAN Algorithm To Analyze Cluster For Large Dataset
No ratings yet
An Improvement of DBSCAN Algorithm To Analyze Cluster For Large Dataset
5 pages
Density Based Clustering Methods
No ratings yet
Density Based Clustering Methods
15 pages
ML Exp 9
No ratings yet
ML Exp 9
5 pages
DBSCAN Algorithm
No ratings yet
DBSCAN Algorithm
15 pages
ML Exp 7
No ratings yet
ML Exp 7
6 pages
A Distribution-Based Clustering Algorithm For Mining in Large Spatial Databases
No ratings yet
A Distribution-Based Clustering Algorithm For Mining in Large Spatial Databases
8 pages
Unsupervised Learning Guide
No ratings yet
Unsupervised Learning Guide
50 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
3 pages
Lecture 5
No ratings yet
Lecture 5
20 pages
ML - 8
No ratings yet
ML - 8
70 pages
Clustering
No ratings yet
Clustering
12 pages
Density Based Clustering
No ratings yet
Density Based Clustering
25 pages
DBSCAN
No ratings yet
DBSCAN
3 pages
DB SCAN Unit 4
No ratings yet
DB SCAN Unit 4
6 pages
DBSCAN
No ratings yet
DBSCAN
42 pages
DBSCAN Clustering Algorithm: Presented by
No ratings yet
DBSCAN Clustering Algorithm: Presented by
22 pages
DB Scan Clustering
No ratings yet
DB Scan Clustering
11 pages
M6
No ratings yet
M6
23 pages
Autoepsdbscan: Dbscan With Eps Automatic For Large Dataset: Manisha Naik Gaonkar & Kedar Sawant
No ratings yet
Autoepsdbscan: Dbscan With Eps Automatic For Large Dataset: Manisha Naik Gaonkar & Kedar Sawant
6 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
69 pages
Clustering 2
No ratings yet
Clustering 2
17 pages
SSRN Id3768295
No ratings yet
SSRN Id3768295
7 pages
Clustering
No ratings yet
Clustering
53 pages
Density Based Clustering Technique
No ratings yet
Density Based Clustering Technique
54 pages
Daftar Anak
No ratings yet
Daftar Anak
48 pages
Cat 4 Weid Electronics
No ratings yet
Cat 4 Weid Electronics
379 pages
Online Safety Questionnaire For Adults
No ratings yet
Online Safety Questionnaire For Adults
2 pages
RIMS ISACA Bridging The Digital Risk Gap - Res - Eng - 0919
No ratings yet
RIMS ISACA Bridging The Digital Risk Gap - Res - Eng - 0919
18 pages
Iot ENG 1
No ratings yet
Iot ENG 1
2 pages
Unsupervised Video Summarization Framework Using Keyframe Extraction and Video Skimming
No ratings yet
Unsupervised Video Summarization Framework Using Keyframe Extraction and Video Skimming
6 pages
Coolermaster Rs 700 Asaa A1 700w Report
No ratings yet
Coolermaster Rs 700 Asaa A1 700w Report
1 page
ProSYS User Manual
No ratings yet
ProSYS User Manual
76 pages
Cambridge IGCSE: Information and Communication Technology 0417/03
No ratings yet
Cambridge IGCSE: Information and Communication Technology 0417/03
8 pages
CTA WEB D CAF 005 01 D3.2 Drive by Data Technology Evaluation Report
No ratings yet
CTA WEB D CAF 005 01 D3.2 Drive by Data Technology Evaluation Report
91 pages
Computer Programming 1 Bachelor of Science in Information Technology
No ratings yet
Computer Programming 1 Bachelor of Science in Information Technology
1 page
Assignment-2 B.Tech-CSE-ALL Subject: Microprocessor and Embedded System Date of Submission: 30 April 2020
100% (1)
Assignment-2 B.Tech-CSE-ALL Subject: Microprocessor and Embedded System Date of Submission: 30 April 2020
3 pages
AI Robot Trouble Shooting Guide: User Was Unable To Download From Links and You Need To Send Ea Direct
No ratings yet
AI Robot Trouble Shooting Guide: User Was Unable To Download From Links and You Need To Send Ea Direct
3 pages
Warehouse and Inventory Mangement (Module 3) - 1
No ratings yet
Warehouse and Inventory Mangement (Module 3) - 1
37 pages
Fast Sine
No ratings yet
Fast Sine
9 pages
CV - Ahmed - Abbassi - Python-Data Engineer Senior - EN
No ratings yet
CV - Ahmed - Abbassi - Python-Data Engineer Senior - EN
3 pages
Admit Card
No ratings yet
Admit Card
2 pages
Automated Flow Lines
No ratings yet
Automated Flow Lines
17 pages
ICT's Impact on Uttarakhand Lifestyle
No ratings yet
ICT's Impact on Uttarakhand Lifestyle
6 pages
Strategic Decisions For Multisided Platforms
No ratings yet
Strategic Decisions For Multisided Platforms
20 pages
CSE3117-Lecture 2-Microprocessor Based PC
No ratings yet
CSE3117-Lecture 2-Microprocessor Based PC
27 pages
Introduction To Responsible AI
No ratings yet
Introduction To Responsible AI
19 pages
Heart Cancer Prediction Using Machine Learning
No ratings yet
Heart Cancer Prediction Using Machine Learning
33 pages
Genius: High Accuracy Three Phase Smart Meter
No ratings yet
Genius: High Accuracy Three Phase Smart Meter
2 pages
Man 8035 Ord Hand
No ratings yet
Man 8035 Ord Hand
1 page
s122 Nrf52 8.0.0 Release-Notes
No ratings yet
s122 Nrf52 8.0.0 Release-Notes
6 pages
Senarai Daftar Harta Modal: BIL Nombor Siri Pendaftaran Nama Aset Catatan Tarikh Perolehan Harga Perolehan Asal
No ratings yet
Senarai Daftar Harta Modal: BIL Nombor Siri Pendaftaran Nama Aset Catatan Tarikh Perolehan Harga Perolehan Asal
3 pages
En DLM Instruction-Manual v2.0
No ratings yet
En DLM Instruction-Manual v2.0
26 pages
Gamestorming Techniques Guide
No ratings yet
Gamestorming Techniques Guide
10 pages
MC Unit 03
No ratings yet
MC Unit 03
13 pages

Unit 4 Cluster Analysis 4

Uploaded by

Unit 4 Cluster Analysis 4

Uploaded by

Probabilistic Model based Clustering

Probabilistic clustering assigns data points to clusters based on

○​ Typically, Gaussian (Normal) distributions are used.​

2.​ Estimate Parameters for Each Cluster​

○​ Each cluster is defined by mean (μ) and covariance (Σ).​

3.​ Assign Probabilities to Each Point​

○​ Each point is assigned to multiple clusters with different probabilities.​

4.​ Optimize Using Expectation-Maximization (EM)​

○​ EM refines cluster parameters to maximize likelihood.

●​ New means will shift based on probabilities.​

●​ New variances will be recomputed.​

●​ New weights will be adjusted.

Step 5: Repeat Until Convergence

Final Cluster Assignment:

●​ Cluster 1: { P1, P2, P3 }​

●​ Cluster 2: { P4, P5, P6 }

❖​Probabilistic assignment: Each point belongs partially to both clusters.​

❖​Soft clustering: Unlike K-means, GMM handles overlapping clusters better.

-​ Hierarchical Clustering for large datasets

Instead of distance we can use R value :

1.​ Label all points as unvisited.​

2.​ Select an unvisited point:​

○​ If it has at least MinPts neighbors within ε, it starts a new cluster.​

3.​ Expand the cluster:​

○​ Add all density-reachable points to the cluster.​

○​ Recursively check their neighbors.​

4.​ Repeat until all points are visited.​

5.​ Return the clusters and noise points.

●​ Core Points (At least 3 points within ε = 2):​

○​ B (Neighbors: {A, C, D})

○​ A (Neighbors: {B, C} → Less than 3) ➝ Border​

●​ Noise Points (Not density-reachable):​

○​ E (Neighbors: {F} → Less than 3)

Step 3: Form Clusters

●​ Cluster 1: {A, B, C, D} (Core points B & C, border points A & D)​

Density Reachable in DBSCAN

In DBSCAN, a point P is density-reachable from another point Q if:

2.​ P is within the ε-neighborhood of Q.​

3.​ There exists a chain of core points leading from Q to P.​

●​ Core points can reach other core or border points.​

●​ Border points cannot reach other points.​

●​ Noise points are not reachable from any other points.

STING (Statistical Information Grid) Clustering

Hierarchical Grid Structure

Level 2 (Middle) → Divides into 4 Large Cells

Level 3 (Bottom) → Further Divides Each into Sub-Cells

The Hopkins Statistic is a method used to test whether a dataset has a

5.​ Compute the Hopkins Statistic H:

If D is highly skewed, H is close to 0

You might also like

○ Typically, Gaussian (Normal) distributions are used.

2. Estimate Parameters for Each Cluster

○ Each cluster is defined by mean (μ) and covariance (Σ).

3. Assign Probabilities to Each Point

○ Each point is assigned to multiple clusters with different probabilities.

4. Optimize Using Expectation-Maximization (EM)

○ EM refines cluster parameters to maximize likelihood.

● New means will shift based on probabilities.

● New variances will be recomputed.

● New weights will be adjusted.

● Cluster 1: { P1, P2, P3 }

● Cluster 2: { P4, P5, P6 }

❖Probabilistic assignment: Each point belongs partially to both clusters.

❖Soft clustering: Unlike K-means, GMM handles overlapping clusters better.

- Hierarchical Clustering for large datasets

1. Label all points as unvisited.

2. Select an unvisited point:

○ If it has at least MinPts neighbors within ε, it starts a new cluster.

3. Expand the cluster:

○ Add all density-reachable points to the cluster.

○ Recursively check their neighbors.

4. Repeat until all points are visited.

5. Return the clusters and noise points.

● Core Points (At least 3 points within ε = 2):

○ B (Neighbors: {A, C, D})

○ A (Neighbors: {B, C} → Less than 3) ➝ Border

● Noise Points (Not density-reachable):

○ E (Neighbors: {F} → Less than 3)

● Cluster 1: {A, B, C, D} (Core points B & C, border points A & D)

2. P is within the ε-neighborhood of Q.

3. There exists a chain of core points leading from Q to P.

● Core points can reach other core or border points.

● Border points cannot reach other points.

● Noise points are not reachable from any other points.

5. Compute the Hopkins Statistic H: