KEMBAR78
Chapter 4 - Clustering | PDF | Cluster Analysis | Data Mining
0% found this document useful (0 votes)
14 views21 pages

Chapter 4 - Clustering

Chapter 4 discusses clustering as an unsupervised machine learning technique for grouping similar data objects, emphasizing methods like K-Means, Hierarchical Clustering, DBSCAN, and K-Nearest Neighbor. It highlights the advantages and disadvantages of each technique, including the assumptions of K-Means and the need for alternatives when those assumptions are violated. The chapter also covers clustering validation methods to assess the quality of clustering results.

Uploaded by

Rana Ben Fraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views21 pages

Chapter 4 - Clustering

Chapter 4 discusses clustering as an unsupervised machine learning technique for grouping similar data objects, emphasizing methods like K-Means, Hierarchical Clustering, DBSCAN, and K-Nearest Neighbor. It highlights the advantages and disadvantages of each technique, including the assumptions of K-Means and the need for alternatives when those assumptions are violated. The chapter also covers clustering validation methods to assess the quality of clustering results.

Uploaded by

Rana Ben Fraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Chapter 4 : Clustering

Credits to Y.Chalgham and Yassine Ghouil Kmar Abessi

1. Introduction to Clustering

Clustering is an unsupervised machine learning technique used to group similar data objects into clusters. Unlike classification, clustering
does not rely on labeled data.

● Cluster analysis has been extensively focused on distance-based cluster analysis


● What is Cluster analysis :
○ clustering is also called data segmentation :
■ Clustering is finding borders between groups
■ Segmentation is using these borders to form groups
● In a class, you observe people eating different snacks. Clustering helps you see that there are three main snacks: chocolate,
bananas, and apples. Segmentation is when you decide to create groups based on these snacks.
○ Clustering is a method of creating segments
○ It can be also used to detect outliers
1
.

2. Clustering Techniques

● K-Means Clustering

Overview:

K-Means is a partition-based clustering method that divides data into clusters based on the nearest mean (centroid).

1
Credits to Yesmine Chalgham Chat GPT. and Yassine Ghouil
Advantages Disadvantages

● Simple and fast. ● Requires specifying .


● Works well for large datasets. ● Sensitive to outliers.
● Suitable for convex-shaped clusters. ● Assumes clusters are spherical
and of similar size.
● It requires that the number of
clusters is predefined
How to Cluster Categorical Data ?

● Replace mean of cluster to Mode of data


● A new dissimilarity measures to deal with categorical objects
● A frequency-based method to update modes of clusters

Data Analysis: Clustering

1. Clusters Are Not Always Gaussian-Shaped

● K-Means assumes spherical clusters (circular in 2D, spherical in higher dimensions) because it uses Euclidean distance to assign
points to the nearest cluster centroid.
● Reality: Clusters can take arbitrary shapes, such as elongated (e.g., elliptical), irregular, or even non-convex shapes. For example,
K-Means may fail to capture the structure of data like two moons or concentric circles.

2. Clusters Should Have Similar Variance

● K-Means is biased towards clusters with similar size and variance because it minimizes the total within-cluster variance.
● Reality: If clusters have different variances (some tight and others spread out), K-Means may misplace the cluster centroids, resulting
in incorrect assignments.

3. Clusters Can Be Highly Imbalanced in Size

● K-Means struggles with unevenly sized clusters because the algorithm attempts to minimize the sum of squared distances without
considering cluster size.
● Reality: In datasets with imbalanced cluster sizes, smaller clusters may be merged into larger ones, or larger clusters may dominate
the centroid placement.

K-Means works by minimizing the total within-cluster variance (also called inertia), which is the sum of squared distances of points to their
cluster's centroid. This optimization assumes clusters are:
1. Spherical in shape
2. Equal in size
3. Equal in variance

If these assumptions are not met, K-Means may produce poor results.

When to Avoid K-Means

● Data with non-spherical clusters.


● Clusters of different densities, sizes, or variances.
● Situations where you need robust handling of outliers (K-Means is sensitive to outliers).

->K-Means performs poorly when clusters are elongated, overlapping, or unevenly distributed, leading to incorrect
groupings.
so we need alternatives when the assumptions are violated .

2. Hierarchical Clustering

Hierarchical clustering builds a nested hierarchy of clusters. It doesn’t require specifying the number of clusters in advance.

Types:

1. Agglomerative (Bottom-Up): Start with each point as a cluster and merge.


2. Divisive (Top-Down): Start with all points in one cluster and split.

Agglomerative Clustering ( Bootom-up) :


1. Compute distances between all points.
2. Merge the two closest clusters.
3. Update the distance matrix based on a linkage criterion :
a. Single Linkage : The distance between clusters is the shortest distance between any pairs of points, one from each cluster
b. Complete Linkage : The distance between two clusters is the longest distance between any pair of points, one from each
cluster
c. Average Linkage : The distance between 2 clusters is the average distance between all pairs of oint , one from each cluster
d. . Centroid Linkage : Distance is calculated based on the centroids (means) of two clusters. The closest centroids are merged.

e. Ward’s Linkage : #kal karawah wahadkom


● Hierarchical clustering method that aims to minimize the variance (or squared differences) within clusters as they are merged
during the clustering process. It is one of the most commonly used methods in hierarchical clustering because it tends to create
compact and well-separated clusters.
● When choosing which clusters to merge, Ward’s method minimizes the increase in the within-cluster sum of
squared distances (variance). Essentially, it merges the two clusters whose combination leads to the least
increase in variance.
● How it works ?
○ Ward's method merges the two clusters that cause the smallest increase in the total within-cluster
variance. This means the two clusters that, when merged, will lead to the smallest increase in the squared
distances of their points to the new centroid (mean) of the combined cluster.

After Merging:

● Once the clusters are merged, the new centroid for the combined cluster is recalculated, which is the mean
of all the points in the new merged cluster.
● The variance of the newly formed cluster is then recalculated, and this process repeats until all points are
merged into one cluster (or a specified number of clusters is reached).
Divisive Clustering : ( top-down)
● Starting with all objects in one cluster
● Subdivides the cluster into smaller and smaller pieces
● It will stops when each objects form a cluster or until it satisfies a certain conditions

Advantages Disadvantages

● No need to predefine number of ● Computationally expensive.


clusters ● Sensitive to noise and outliers.
● Visualized through a dendrogram. ● Complexity

Notes :
Note : Fi kol mara tehseb distance tajmaa akreb zouz w zouz hekom ykawnou cluster jdid baed besh testaamel complete linkage maanaha
tekhtar ab3ed nokta al cluster eli besh tbasi aleha. Kenou single linkage besh tekhtar akreb nokta sinon l average w tehseb distance ken bel
Euclidian Method wala b zouz lokhrin .

How to cut the dendrogram to identify the number of clusters :

a. Specify the Number of Clusters

● Decide beforehand how many clusters you want (e.g., kkk).


● Cut the tree at the height where there are exactly kkk branches (clusters) below the cut.
● This is a straightforward method if you have a target number of clusters in mind.

b. Use a Dissimilarity Threshold : Maaneha tgoss chajra wakteli yfout etoul mta l khat niveau mou3ayen khater yekhtalef binet l
cluster lowel wel cluster lekhra w me tesselnish kifeh yetehseb )
● Define a maximum allowable distance (or dissimilarity) for merging clusters.
● Cut the tree at this threshold height.
● Any cluster merging above this height is not allowed, resulting in multiple clusters.

C.Highest Jump (Elbow Method in Dendrograms)

● After hierarchical clustering, examine the dendrogram for the largest vertical gap (or "jump") in the linkage distance.
● This jump indicates a significant dissimilarity between clusters. By cutting the dendrogram just before this jump, you can
determine a reasonable number of clusters.

ELBOW METHOD: “me trakazesh fehom yesser

● calculated using “within cluster sum of square WCSS”


● we plot the the wcss to the number of clusters when the wcss is constant and 0 we can say that this is the
perfect number of cluster

Now how to evaluate it ?

Silhouette Analysis

● Measures how similar each point is to its own cluster compared to other clusters.=
● Silhouette Score ranges from −1-1−1 to +1+1+1:
○ +1: Point is well-clustered.
○ 0: Point is on the boundary between clusters.
○ −1: Point is likely misclassified.

Gap statistic clustering:


1. Compute Within-Cluster Dispersion (WkW_kWk​):
○ Measure how compact the clusters are in your data.
2. Generate Random Reference Data:
○ Create multiple random datasets with the same dimensions and range as your original data.
3. Calculate Dispersion for Random Data:
○ Cluster the random datasets for each kkk and compute their dispersion.
4. Calculate Gap:
○ A larger gap indicates better clustering.
5. Choose Optimal k:
○ Select k where the gap is maximized or stabilizes significantly.

The gap statistic determines the optimal number of clusters (kkk) by comparing the within-cluster dispersion of your data to that of random
data. It identifies how well-separated the clusters are compared to a random baseline.

4. DBSCAN Clustering

Overview:

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points into clusters based on density. It can identify noise and
clusters of arbitrary shapes.

Key Concepts:

● Epsilon (ε): Maximum distance between two points to be considered neighbors.


● MinPts: Minimum number of points required to form a dense region.

Algorithm Steps:
1. Identify core points .A point is considered a core point if it has at least MinPts points (including itself) within a given radius ε (epsilon).
These points are the ones that form the core of a cluster.

The radius of the orange circle is Epsilon and is user defined as well as the minimum points that a circle should cover.

2. Expand clusters by connecting core points and their neighbors.


3. Mark points that are not reachable as noise.
Advantages Disadvantages

● Detects clusters of arbitrary shapes. ● Choosing ε and MinPts can be


● Handles noise well. tricky.
● Struggles with varying
densities.

5. K-Nearest Neighbor (k-NN)

k-NN is a classification algorithm used for classifying a point based on the labels of its k nearest neighbors. The idea is that the class of a
point is determined by the majority class of its k nearest neighbors.

Algorithm Steps (for Classification):

1. Define k: You choose a value for k, which determines how many neighbors to consider. For example, k=3 means that the 3 closest
points will decide the class of the new point.
2. Measure Distance: The algorithm calculates the distance between the new point and all the other points in the dataset. Common
distance metrics are Euclidean distance, Manhattan distance, etc.
3. Identify Neighbors: It selects the k nearest neighbors based on the distance.
4. Voting: It looks at the classes of the k nearest neighbors. The most frequent class among these neighbors will be the predicted class
for the new point
Advantages Disadvantages

● Simple and intuitive. ● Computationally expensive for


● Adapts to different datasets with large datasets.
proper . ● Sensitive to noisy or irrelevant
features.

➔ K-Nearest Neighbor - Breast Cancer Example Summary

6. Spectral Clustering

7. Gaussian Mixture Models (GMMs) #kal akrawah wahadkom

Instead of assigning each point to a single cluster like K-Means, GMMs calculate the probability of each point belonging to a
cluster.

Uses the Expectation-Maximization (EM) algorithm to iteratively adjust cluster parameters (mean, variance, and weight).

7. Clustering Validation

Why Validate?

To evaluate the quality of clustering results and compare algorithms.


Clustering Validation

You might also like