0% found this document useful (0 votes)

8 views3 pages

Clustering

Clustering is an unsupervised learning technique that groups similar data points to identify patterns without predefined labels, with applications in various fields such as marketing and biology. Methods for determining the number of clusters include specifying a target number, using a dissimilarity threshold, and applying the Elbow Method. Evaluation techniques like Silhouette Analysis and the Gap Statistic help assess the quality and separation of the clusters formed.

Uploaded by

Rana Ben Fraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views3 pages

Clustering

Uploaded by

Rana Ben Fraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Clustering

Introduction to Clustering
Clustering is an essential unsupervised learning technique used in data
analysis to group similar data points into clusters based on certain
characteristics or features. The goal of clustering is to identify patterns or
structures in data without predefined labels. These methods are widely used in
fields such as marketing (customer segmentation), biology (gene expression
analysis), and social network analysis.

Core Concepts of Clustering

1. Clusters: Collections of data points grouped together based on similarity.

2. Similarity/Dissimilarity: Measured using distance metrics such as

Euclidean, Manhattan, or cosine similarity.

3. Applications: Data compression, anomaly detection, and exploratory data

analysis

How to cut the dendrogram to identify the number

of clusters :
a. Specify the Number of Clusters
Decide beforehand how many clusters you want (e.g., kkk).

Cut the tree at the height where there are exactly kkk branches (clusters)
below the cut.

This is a straightforward method if you have a target number of clusters in

mind.

b. Use a Dissimilarity Threshold : Maaneha tgoss chajra wakteli

yfout etoul mta l khat niveau mou3ayen khater yekhtalef binet l
cluster lowel wel cluster lekhra w me tesselnish kifeh yetehseb
)
Define a maximum allowable distance (or dissimilarity) for merging clusters.

Clustering 1
Cut the tree at this threshold height.

Any cluster merging above this height is not allowed, resulting in multiple
clusters.

C.Highest Jump (Elbow Method in Dendrograms) METHODE

KHAYBA
After hierarchical clustering, examine the dendrogram for the largest
vertical gap (or "jump") in the linkage distance.

This jump indicates a significant dissimilarity between clusters. By cutting

the dendrogram just before this jump, you can determine a reasonable
number of clusters.

ELBOW METHOD: “me trakazesh fehom yesser

calculated using “within cluster sum of square WCSS”

we plot the the wcss to the number of clusters when the wcss is constant
and 0 we can say that this is the perfect number of cluster

Now how to evaluate it ?

Silhouette Analysis
Measures how similar each point is to its own cluster compared to other
clusters.=

Silhouette Score ranges from −1-1−1 to +1+1+1:

+1: Point is well-clustered.

0: Point is on the boundary between clusters.

−1: Point is likely misclassified.

Gap statistic clustering: Ahsen wahda talka nb of clusters

1. Compute Within-Cluster Dispersion (WkW_kWk):

Measure how compact the clusters are in your data.

2. Generate Random Reference Data:

Create multiple random datasets with the same dimensions and range
as your original data.

3. Calculate Dispersion for Random Data:

Clustering 2
Cluster the random datasets for each kkk and compute their dispersion.

4. Calculate Gap:

A larger gap indicates better clustering.

5. Choose Optimal k:

Select k where the gap is maximized or stabilizes significantly.

The gap statistic determines the optimal number of clusters (kkk) by

comparing the within-cluster dispersion of your data to that of random data. It
identifies how well-separated the clusters are compared to a random baseline.

What is Clustering?
Clustering is a method to group similar data points together based on their characteristics.
It’s used to find patterns in data without labels (unsupervised learning).
Examples: In marketing (to group similar customers), in biology (for gene analysis), or in social networks (to find similar
users).
Key Terms in Clustering:
Clusters: Groups of similar data points.
Similarity/Dissimilarity: Measures how close or far apart data points are from each other (e.g., using distance metrics like
Euclidean distance).
Applications: Used for things like compressing data, detecting unusual data points (anomalies), or exploring data.
How to Decide the Number of Clusters:
Specify the Number of Clusters:

Decide how many clusters you want (e.g., 3 clusters).

Cut the tree (dendrogram) at the point where there are exactly 3 branches.
Use a Dissimilarity Threshold:

Set a limit for how different clusters can be before they are considered separate.
If the distance between clusters is too high, don’t merge them.
Highest Jump (Elbow Method):

After clustering, look at the dendrogram and find the biggest jump in distance.
Cut just before this jump to find a reasonable number of clusters.
The Elbow Method helps you find the number of clusters by plotting how "spread out" the data is. When the spread stops
changing a lot, you’ve found the right number of clusters.
Evaluating Clustering Quality:
Silhouette Analysis:

Measures how similar a point is to its own cluster compared to other clusters.
Score:
+1 = Well clustered.
0 = On the border between clusters.
-1 = Likely in the wrong cluster.
Gap Statistic:

Measures how well-separated your clusters are.

Steps:
Calculate how tight (compact) the clusters are.
Compare this with random data to see if your clusters are better.
If there’s a big gap, it means your clusters are good.
Summary:
Clustering groups similar data together to find hidden patterns.
You decide how many clusters to have, often by using methods like the Elbow Method or Gap Statistic.
To evaluate how good the clustering is, you can use Silhouette Analysis or the Gap Statistic.
Clustering 3

Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
K Means Clustering
No ratings yet
K Means Clustering
13 pages
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
No ratings yet
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
24 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
SPK Clustering
No ratings yet
SPK Clustering
35 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
Understanding Clustering - A Comprehensive Guide To
No ratings yet
Understanding Clustering - A Comprehensive Guide To
5 pages
Clustering
No ratings yet
Clustering
20 pages
Clustering - Unit 4
No ratings yet
Clustering - Unit 4
19 pages
K-Means Clustering
No ratings yet
K-Means Clustering
14 pages
EML %TH Module
No ratings yet
EML %TH Module
40 pages
Kmeans Clustering
No ratings yet
Kmeans Clustering
3 pages
Lecture 6
No ratings yet
Lecture 6
42 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Module12.02 UnsupervisedLearning
No ratings yet
Module12.02 UnsupervisedLearning
25 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Stat 390 Presentation 2
No ratings yet
Stat 390 Presentation 2
14 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
10 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Unsupervised Learning Insights
No ratings yet
Unsupervised Learning Insights
10 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Week 10
No ratings yet
Week 10
84 pages
Finding Optimal Number of Clusters
No ratings yet
Finding Optimal Number of Clusters
53 pages
Cluster Analysis: Kaushik B
No ratings yet
Cluster Analysis: Kaushik B
41 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
6 pages
K-Means Clustering Insights
No ratings yet
K-Means Clustering Insights
8 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Clustering: An Overview: Key Concepts Objective
No ratings yet
Clustering: An Overview: Key Concepts Objective
12 pages
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
No ratings yet
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
31 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Cluster Analysis
No ratings yet
Cluster Analysis
37 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
69 pages
Lecture 9 Kmean-V3
No ratings yet
Lecture 9 Kmean-V3
52 pages
Unit IV
No ratings yet
Unit IV
6 pages
21AI71 Module 5 Textbook
No ratings yet
21AI71 Module 5 Textbook
25 pages
Unt III (DS)
No ratings yet
Unt III (DS)
49 pages
Clustering
No ratings yet
Clustering
75 pages
Unsupervised Learning: Clustering Algorithms
No ratings yet
Unsupervised Learning: Clustering Algorithms
13 pages
Unit 4 Cluster Analysis 3
No ratings yet
Unit 4 Cluster Analysis 3
20 pages
Silhouette (Clustering) : Method
No ratings yet
Silhouette (Clustering) : Method
7 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
Clustering Algorithms & Evaluation in R
No ratings yet
Clustering Algorithms & Evaluation in R
11 pages
Module 5
No ratings yet
Module 5
43 pages
Unit 4 ML
No ratings yet
Unit 4 ML
14 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
DWM 4
No ratings yet
DWM 4
14 pages
Clustering Analysis
No ratings yet
Clustering Analysis
12 pages
Unit Iv
No ratings yet
Unit Iv
19 pages
Machine Learning Bloque 4
No ratings yet
Machine Learning Bloque 4
12 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
My Lecture On CLUSTER ANALYSIS PDF
No ratings yet
My Lecture On CLUSTER ANALYSIS PDF
55 pages
Clustering
No ratings yet
Clustering
11 pages
Machine Learning
No ratings yet
Machine Learning
13 pages
Chapter1 Introduction Java 2024
No ratings yet
Chapter1 Introduction Java 2024
61 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Stuvia 2830580 Test Bank For International Economics Theory and Policy 11th Edition Krugman All Chapters 1 22 Full Complete 2023 2024 1.Pdf5
100% (1)
Stuvia 2830580 Test Bank For International Economics Theory and Policy 11th Edition Krugman All Chapters 1 22 Full Complete 2023 2024 1.Pdf5
9 pages
Chapter4-Blockchain Application Design
No ratings yet
Chapter4-Blockchain Application Design
17 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Chap1-3 (IA) Complexity - BigO
No ratings yet
Chap1-3 (IA) Complexity - BigO
104 pages
Spectral Clustering
No ratings yet
Spectral Clustering
4 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
NoteGPT Flashcards 1739123443917
No ratings yet
NoteGPT Flashcards 1739123443917
10 pages
International Trade Insights - Scholarly Flashcards
No ratings yet
International Trade Insights - Scholarly Flashcards
4 pages
PDF
No ratings yet
PDF
5 pages
Exponential Smoothing Ovherview
No ratings yet
Exponential Smoothing Ovherview
4 pages
Chap1-2 (IA) Complexity - Examples
No ratings yet
Chap1-2 (IA) Complexity - Examples
167 pages
Chapter 1summary Request
No ratings yet
Chapter 1summary Request
4 pages
누리-세종학당 온라인 한국어 레벨테스트 시스템 Test
No ratings yet
누리-세종학당 온라인 한국어 레벨테스트 시스템 Test
2 pages
PracticeQuestions Final
No ratings yet
PracticeQuestions Final
9 pages
Introduction To Mathematical Finance and Derivatives (PHD) : Lecturer
No ratings yet
Introduction To Mathematical Finance and Derivatives (PHD) : Lecturer
3 pages
Lesson Note For Second Term SS2 (2024)
No ratings yet
Lesson Note For Second Term SS2 (2024)
20 pages
Nonhomogeneous Differential Equations
No ratings yet
Nonhomogeneous Differential Equations
13 pages
Knowledge Representation Techniques 2.1 Predicate Calculus 2.1.1 Background
No ratings yet
Knowledge Representation Techniques 2.1 Predicate Calculus 2.1.1 Background
4 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
Work Environment's Impact on Commitment
No ratings yet
Work Environment's Impact on Commitment
14 pages
Chess Notation
No ratings yet
Chess Notation
9 pages
Technical Drawing 8 (Q1-Week 1)
No ratings yet
Technical Drawing 8 (Q1-Week 1)
4 pages
CSIR NET Physical Sciences Syllabus
No ratings yet
CSIR NET Physical Sciences Syllabus
4 pages
CBSE Sample Paper For Class 6 Hindi: Section - A
No ratings yet
CBSE Sample Paper For Class 6 Hindi: Section - A
12 pages
Fast Fashion's Strategic Impact
No ratings yet
Fast Fashion's Strategic Impact
32 pages
Control Systems: GATE Objective & Numerical Type Solutions
No ratings yet
Control Systems: GATE Objective & Numerical Type Solutions
15 pages
ch07 fn202
No ratings yet
ch07 fn202
61 pages
Two Way Slab Punching Shear Check
No ratings yet
Two Way Slab Punching Shear Check
1 page
The Growth Strategies of Hotel Chains PDF
No ratings yet
The Growth Strategies of Hotel Chains PDF
215 pages
Q3 - WS - Mathematics 7 - Lesson 3 - Week 3
No ratings yet
Q3 - WS - Mathematics 7 - Lesson 3 - Week 3
12 pages
GR 3 Math Chapter 8
No ratings yet
GR 3 Math Chapter 8
24 pages
Modelling and Optimization of The Velocity Profiles at The Draft Tube Inlet of A Francis Turbine Within An Operating Range
No ratings yet
Modelling and Optimization of The Velocity Profiles at The Draft Tube Inlet of A Francis Turbine Within An Operating Range
17 pages
Maximum Mark: 96: Cambridge International General Certificate of Secondary Education (9-1)
No ratings yet
Maximum Mark: 96: Cambridge International General Certificate of Secondary Education (9-1)
8 pages
Study Schedules
No ratings yet
Study Schedules
6 pages
Acct 4103 Solve
No ratings yet
Acct 4103 Solve
10 pages
Business Process Re-Engineering Guide
No ratings yet
Business Process Re-Engineering Guide
2 pages
Results 2009 DVC Accountant Advt
100% (2)
Results 2009 DVC Accountant Advt
5 pages
Wind Pressure Calculation ASCE 7-05
100% (1)
Wind Pressure Calculation ASCE 7-05
8 pages
Mathematics HL - The Exploration Companion
No ratings yet
Mathematics HL - The Exploration Companion
20 pages
CY6151 Engineering Chemistry - I - 2 Marks With Answers
0% (1)
CY6151 Engineering Chemistry - I - 2 Marks With Answers
4 pages
Scalar and Vector Quantities Answer Key
No ratings yet
Scalar and Vector Quantities Answer Key
4 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Math 104, Homework #12: Due Thursday, April 28
No ratings yet
Math 104, Homework #12: Due Thursday, April 28
4 pages
Simplification of Boolean Expression
100% (15)
Simplification of Boolean Expression
60 pages

Clustering

Uploaded by

Clustering

Uploaded by

Clustering

Core Concepts of Clustering

2. Similarity/Dissimilarity: Measured using distance metrics such as

3. Applications: Data compression, anomaly detection, and exploratory data

How to cut the dendrogram to identify the number

This is a straightforward method if you have a target number of clusters in

b. Use a Dissimilarity Threshold : Maaneha tgoss chajra wakteli

C.Highest Jump (Elbow Method in Dendrograms) METHODE

This jump indicates a significant dissimilarity between clusters. By cutting

ELBOW METHOD: “me trakazesh fehom yesser

calculated using “within cluster sum of square WCSS”

Now how to evaluate it ?

Silhouette Score ranges from −1-1−1 to +1+1+1:

+1: Point is well-clustered.

0: Point is on the boundary between clusters.

−1: Point is likely misclassified.

Gap statistic clustering: Ahsen wahda talka nb of clusters

Measure how compact the clusters are in your data.

2. Generate Random Reference Data:

3. Calculate Dispersion for Random Data:

A larger gap indicates better clustering.

Select k where the gap is maximized or stabilizes significantly.

The gap statistic determines the optimal number of clusters (kkk) by

Decide how many clusters you want (e.g., 3 clusters).

Measures how well-separated your clusters are.

You might also like