Clustering
Introduction to Clustering
Clustering is an essential unsupervised learning technique used in data
analysis to group similar data points into clusters based on certain
characteristics or features. The goal of clustering is to identify patterns or
structures in data without predefined labels. These methods are widely used in
fields such as marketing (customer segmentation), biology (gene expression
analysis), and social network analysis.
Core Concepts of Clustering
1. Clusters: Collections of data points grouped together based on similarity.
2. Similarity/Dissimilarity: Measured using distance metrics such as
Euclidean, Manhattan, or cosine similarity.
3. Applications: Data compression, anomaly detection, and exploratory data
analysis
How to cut the dendrogram to identify the number
of clusters :
a. Specify the Number of Clusters
Decide beforehand how many clusters you want (e.g., kkk).
Cut the tree at the height where there are exactly kkk branches (clusters)
below the cut.
This is a straightforward method if you have a target number of clusters in
mind.
b. Use a Dissimilarity Threshold : Maaneha tgoss chajra wakteli
yfout etoul mta l khat niveau mou3ayen khater yekhtalef binet l
cluster lowel wel cluster lekhra w me tesselnish kifeh yetehseb
)
Define a maximum allowable distance (or dissimilarity) for merging clusters.
Clustering 1
Cut the tree at this threshold height.
Any cluster merging above this height is not allowed, resulting in multiple
clusters.
C.Highest Jump (Elbow Method in Dendrograms) METHODE
KHAYBA
After hierarchical clustering, examine the dendrogram for the largest
vertical gap (or "jump") in the linkage distance.
This jump indicates a significant dissimilarity between clusters. By cutting
the dendrogram just before this jump, you can determine a reasonable
number of clusters.
ELBOW METHOD: “me trakazesh fehom yesser
calculated using “within cluster sum of square WCSS”
we plot the the wcss to the number of clusters when the wcss is constant
and 0 we can say that this is the perfect number of cluster
Now how to evaluate it ?
Silhouette Analysis
Measures how similar each point is to its own cluster compared to other
clusters.=
Silhouette Score ranges from −1-1−1 to +1+1+1:
+1: Point is well-clustered.
0: Point is on the boundary between clusters.
−1: Point is likely misclassified.
Gap statistic clustering: Ahsen wahda talka nb of clusters
1. Compute Within-Cluster Dispersion (WkW_kWk):
Measure how compact the clusters are in your data.
2. Generate Random Reference Data:
Create multiple random datasets with the same dimensions and range
as your original data.
3. Calculate Dispersion for Random Data:
Clustering 2
Cluster the random datasets for each kkk and compute their dispersion.
4. Calculate Gap:
A larger gap indicates better clustering.
5. Choose Optimal k:
Select k where the gap is maximized or stabilizes significantly.
The gap statistic determines the optimal number of clusters (kkk) by
comparing the within-cluster dispersion of your data to that of random data. It
identifies how well-separated the clusters are compared to a random baseline.
What is Clustering?
Clustering is a method to group similar data points together based on their characteristics.
It’s used to find patterns in data without labels (unsupervised learning).
Examples: In marketing (to group similar customers), in biology (for gene analysis), or in social networks (to find similar
users).
Key Terms in Clustering:
Clusters: Groups of similar data points.
Similarity/Dissimilarity: Measures how close or far apart data points are from each other (e.g., using distance metrics like
Euclidean distance).
Applications: Used for things like compressing data, detecting unusual data points (anomalies), or exploring data.
How to Decide the Number of Clusters:
Specify the Number of Clusters:
Decide how many clusters you want (e.g., 3 clusters).
Cut the tree (dendrogram) at the point where there are exactly 3 branches.
Use a Dissimilarity Threshold:
Set a limit for how different clusters can be before they are considered separate.
If the distance between clusters is too high, don’t merge them.
Highest Jump (Elbow Method):
After clustering, look at the dendrogram and find the biggest jump in distance.
Cut just before this jump to find a reasonable number of clusters.
The Elbow Method helps you find the number of clusters by plotting how "spread out" the data is. When the spread stops
changing a lot, you’ve found the right number of clusters.
Evaluating Clustering Quality:
Silhouette Analysis:
Measures how similar a point is to its own cluster compared to other clusters.
Score:
+1 = Well clustered.
0 = On the border between clusters.
-1 = Likely in the wrong cluster.
Gap Statistic:
Measures how well-separated your clusters are.
Steps:
Calculate how tight (compact) the clusters are.
Compare this with random data to see if your clusters are better.
If there’s a big gap, it means your clusters are good.
Summary:
Clustering groups similar data together to find hidden patterns.
You decide how many clusters to have, often by using methods like the Elbow Method or Gap Statistic.
To evaluate how good the clustering is, you can use Silhouette Analysis or the Gap Statistic.
Clustering 3