0% found this document useful (0 votes)

14 views21 pages

Chapter 4 - Clustering

Chapter 4 discusses clustering as an unsupervised machine learning technique for grouping similar data objects, emphasizing methods like K-Means, Hierarchical Clustering, DBSCAN, and K-Nearest Neighbor. It highlights the advantages and disadvantages of each technique, including the assumptions of K-Means and the need for alternatives when those assumptions are violated. The chapter also covers clustering validation methods to assess the quality of clustering results.

Uploaded by

Rana Ben Fraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views21 pages

Chapter 4 - Clustering

Uploaded by

Rana Ben Fraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Chapter 4 : Clustering

Credits to Y.Chalgham and Yassine Ghouil Kmar Abessi

1. Introduction to Clustering

Clustering is an unsupervised machine learning technique used to group similar data objects into clusters. Unlike classification, clustering
does not rely on labeled data.

● Cluster analysis has been extensively focused on distance-based cluster analysis

● What is Cluster analysis :
○ clustering is also called data segmentation :
■ Clustering is finding borders between groups
■ Segmentation is using these borders to form groups
● In a class, you observe people eating different snacks. Clustering helps you see that there are three main snacks: chocolate,
bananas, and apples. Segmentation is when you decide to create groups based on these snacks.
○ Clustering is a method of creating segments
○ It can be also used to detect outliers
1
.

2. Clustering Techniques

● K-Means Clustering

Overview:

K-Means is a partition-based clustering method that divides data into clusters based on the nearest mean (centroid).

1
Credits to Yesmine Chalgham Chat GPT. and Yassine Ghouil
Advantages Disadvantages

● Simple and fast. ● Requires specifying .

● Works well for large datasets. ● Sensitive to outliers.
● Suitable for convex-shaped clusters. ● Assumes clusters are spherical
and of similar size.
● It requires that the number of
clusters is predefined
How to Cluster Categorical Data ?

● Replace mean of cluster to Mode of data

● A new dissimilarity measures to deal with categorical objects
● A frequency-based method to update modes of clusters

Data Analysis: Clustering

1. Clusters Are Not Always Gaussian-Shaped

● K-Means assumes spherical clusters (circular in 2D, spherical in higher dimensions) because it uses Euclidean distance to assign
points to the nearest cluster centroid.
● Reality: Clusters can take arbitrary shapes, such as elongated (e.g., elliptical), irregular, or even non-convex shapes. For example,
K-Means may fail to capture the structure of data like two moons or concentric circles.

2. Clusters Should Have Similar Variance

● K-Means is biased towards clusters with similar size and variance because it minimizes the total within-cluster variance.
● Reality: If clusters have different variances (some tight and others spread out), K-Means may misplace the cluster centroids, resulting
in incorrect assignments.

3. Clusters Can Be Highly Imbalanced in Size

● K-Means struggles with unevenly sized clusters because the algorithm attempts to minimize the sum of squared distances without
considering cluster size.
● Reality: In datasets with imbalanced cluster sizes, smaller clusters may be merged into larger ones, or larger clusters may dominate
the centroid placement.

K-Means works by minimizing the total within-cluster variance (also called inertia), which is the sum of squared distances of points to their
cluster's centroid. This optimization assumes clusters are:
1. Spherical in shape
2. Equal in size
3. Equal in variance

If these assumptions are not met, K-Means may produce poor results.

When to Avoid K-Means

● Data with non-spherical clusters.

● Clusters of different densities, sizes, or variances.
● Situations where you need robust handling of outliers (K-Means is sensitive to outliers).

->K-Means performs poorly when clusters are elongated, overlapping, or unevenly distributed, leading to incorrect
groupings.
so we need alternatives when the assumptions are violated .

2. Hierarchical Clustering

Hierarchical clustering builds a nested hierarchy of clusters. It doesn’t require specifying the number of clusters in advance.

Types:

1. Agglomerative (Bottom-Up): Start with each point as a cluster and merge.

2. Divisive (Top-Down): Start with all points in one cluster and split.

Agglomerative Clustering ( Bootom-up) :

1. Compute distances between all points.
2. Merge the two closest clusters.
3. Update the distance matrix based on a linkage criterion :
a. Single Linkage : The distance between clusters is the shortest distance between any pairs of points, one from each cluster
b. Complete Linkage : The distance between two clusters is the longest distance between any pair of points, one from each
cluster
c. Average Linkage : The distance between 2 clusters is the average distance between all pairs of oint , one from each cluster
d. . Centroid Linkage : Distance is calculated based on the centroids (means) of two clusters. The closest centroids are merged.

e. Ward’s Linkage : #kal karawah wahadkom

● Hierarchical clustering method that aims to minimize the variance (or squared differences) within clusters as they are merged
during the clustering process. It is one of the most commonly used methods in hierarchical clustering because it tends to create
compact and well-separated clusters.
● When choosing which clusters to merge, Ward’s method minimizes the increase in the within-cluster sum of
squared distances (variance). Essentially, it merges the two clusters whose combination leads to the least
increase in variance.
● How it works ?
○ Ward's method merges the two clusters that cause the smallest increase in the total within-cluster
variance. This means the two clusters that, when merged, will lead to the smallest increase in the squared
distances of their points to the new centroid (mean) of the combined cluster.

After Merging:

● Once the clusters are merged, the new centroid for the combined cluster is recalculated, which is the mean
of all the points in the new merged cluster.
● The variance of the newly formed cluster is then recalculated, and this process repeats until all points are
merged into one cluster (or a specified number of clusters is reached).
Divisive Clustering : ( top-down)
● Starting with all objects in one cluster
● Subdivides the cluster into smaller and smaller pieces
● It will stops when each objects form a cluster or until it satisfies a certain conditions

Advantages Disadvantages

● No need to predefine number of ● Computationally expensive.

clusters ● Sensitive to noise and outliers.
● Visualized through a dendrogram. ● Complexity

Notes :
Note : Fi kol mara tehseb distance tajmaa akreb zouz w zouz hekom ykawnou cluster jdid baed besh testaamel complete linkage maanaha
tekhtar ab3ed nokta al cluster eli besh tbasi aleha. Kenou single linkage besh tekhtar akreb nokta sinon l average w tehseb distance ken bel
Euclidian Method wala b zouz lokhrin .

How to cut the dendrogram to identify the number of clusters :

a. Specify the Number of Clusters

● Decide beforehand how many clusters you want (e.g., kkk).

● Cut the tree at the height where there are exactly kkk branches (clusters) below the cut.
● This is a straightforward method if you have a target number of clusters in mind.

b. Use a Dissimilarity Threshold : Maaneha tgoss chajra wakteli yfout etoul mta l khat niveau mou3ayen khater yekhtalef binet l
cluster lowel wel cluster lekhra w me tesselnish kifeh yetehseb )
● Define a maximum allowable distance (or dissimilarity) for merging clusters.
● Cut the tree at this threshold height.
● Any cluster merging above this height is not allowed, resulting in multiple clusters.

C.Highest Jump (Elbow Method in Dendrograms)

● After hierarchical clustering, examine the dendrogram for the largest vertical gap (or "jump") in the linkage distance.
● This jump indicates a significant dissimilarity between clusters. By cutting the dendrogram just before this jump, you can
determine a reasonable number of clusters.

ELBOW METHOD: “me trakazesh fehom yesser

● calculated using “within cluster sum of square WCSS”

● we plot the the wcss to the number of clusters when the wcss is constant and 0 we can say that this is the
perfect number of cluster

Now how to evaluate it ?

Silhouette Analysis

● Measures how similar each point is to its own cluster compared to other clusters.=
● Silhouette Score ranges from −1-1−1 to +1+1+1:
○ +1: Point is well-clustered.
○ 0: Point is on the boundary between clusters.
○ −1: Point is likely misclassified.

Gap statistic clustering:

1. Compute Within-Cluster Dispersion (WkW_kWk):
○ Measure how compact the clusters are in your data.
2. Generate Random Reference Data:
○ Create multiple random datasets with the same dimensions and range as your original data.
3. Calculate Dispersion for Random Data:
○ Cluster the random datasets for each kkk and compute their dispersion.
4. Calculate Gap:
○ A larger gap indicates better clustering.
5. Choose Optimal k:
○ Select k where the gap is maximized or stabilizes significantly.

The gap statistic determines the optimal number of clusters (kkk) by comparing the within-cluster dispersion of your data to that of random
data. It identifies how well-separated the clusters are compared to a random baseline.

4. DBSCAN Clustering

Overview:

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points into clusters based on density. It can identify noise and
clusters of arbitrary shapes.

Key Concepts:

● Epsilon (ε): Maximum distance between two points to be considered neighbors.

● MinPts: Minimum number of points required to form a dense region.

Algorithm Steps:
1. Identify core points .A point is considered a core point if it has at least MinPts points (including itself) within a given radius ε (epsilon).
These points are the ones that form the core of a cluster.

The radius of the orange circle is Epsilon and is user defined as well as the minimum points that a circle should cover.

2. Expand clusters by connecting core points and their neighbors.

3. Mark points that are not reachable as noise.
Advantages Disadvantages

● Detects clusters of arbitrary shapes. ● Choosing ε and MinPts can be

● Handles noise well. tricky.
● Struggles with varying
densities.

5. K-Nearest Neighbor (k-NN)

k-NN is a classification algorithm used for classifying a point based on the labels of its k nearest neighbors. The idea is that the class of a
point is determined by the majority class of its k nearest neighbors.

Algorithm Steps (for Classification):

1. Define k: You choose a value for k, which determines how many neighbors to consider. For example, k=3 means that the 3 closest
points will decide the class of the new point.
2. Measure Distance: The algorithm calculates the distance between the new point and all the other points in the dataset. Common
distance metrics are Euclidean distance, Manhattan distance, etc.
3. Identify Neighbors: It selects the k nearest neighbors based on the distance.
4. Voting: It looks at the classes of the k nearest neighbors. The most frequent class among these neighbors will be the predicted class
for the new point
Advantages Disadvantages

● Simple and intuitive. ● Computationally expensive for

● Adapts to different datasets with large datasets.
proper . ● Sensitive to noisy or irrelevant
features.

➔ K-Nearest Neighbor - Breast Cancer Example Summary

6. Spectral Clustering

7. Gaussian Mixture Models (GMMs) #kal akrawah wahadkom

Instead of assigning each point to a single cluster like K-Means, GMMs calculate the probability of each point belonging to a
cluster.

Uses the Expectation-Maximization (EM) algorithm to iteratively adjust cluster parameters (mean, variance, and weight).

7. Clustering Validation

Why Validate?

To evaluate the quality of clustering results and compare algorithms.

Clustering Validation

Clustering - Unit 4
No ratings yet
Clustering - Unit 4
19 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
Module 5
No ratings yet
Module 5
43 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Week 10
No ratings yet
Week 10
84 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
DWM 4
No ratings yet
DWM 4
14 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
51 pages
Unit 4 Cluster Analysis 3
No ratings yet
Unit 4 Cluster Analysis 3
20 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
10 pages
K-Means Clustering Guide
100% (1)
K-Means Clustering Guide
14 pages
Unit 5
No ratings yet
Unit 5
10 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Unit IV
No ratings yet
Unit IV
6 pages
Understanding Clustering - A Comprehensive Guide To
No ratings yet
Understanding Clustering - A Comprehensive Guide To
5 pages
Clustering
No ratings yet
Clustering
11 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
80 pages
Partition
No ratings yet
Partition
52 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
Clustering Personal
No ratings yet
Clustering Personal
9 pages
Unit 4 ML
No ratings yet
Unit 4 ML
14 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
Unsupervised Learning: Clustering Algorithms
No ratings yet
Unsupervised Learning: Clustering Algorithms
13 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
Data Science Session 8 Clustering V0
No ratings yet
Data Science Session 8 Clustering V0
30 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
69 pages
Clustering Analysis
No ratings yet
Clustering Analysis
12 pages
SPK Clustering
No ratings yet
SPK Clustering
35 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
ML Unit 4
No ratings yet
ML Unit 4
110 pages
EML %TH Module
No ratings yet
EML %TH Module
40 pages
Clustering
No ratings yet
Clustering
38 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
6 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
Lecture-02 Unsupervised Learning Algorithm (Clustering)
No ratings yet
Lecture-02 Unsupervised Learning Algorithm (Clustering)
60 pages
Unit5 CSM ML
No ratings yet
Unit5 CSM ML
32 pages
Clustering
No ratings yet
Clustering
75 pages
M5
No ratings yet
M5
40 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
Unit 4
No ratings yet
Unit 4
25 pages
Clustering
No ratings yet
Clustering
53 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Clustering: An Overview: Key Concepts Objective
No ratings yet
Clustering: An Overview: Key Concepts Objective
12 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
Clustering
No ratings yet
Clustering
3 pages
Clustering
No ratings yet
Clustering
6 pages
Apznzay5vyj1g6gkah Kmbaixbpduyak6bcwuvl7ninq7zt7srgn 19bdjz0i5mveqgxmyzs4sqz261v5rbp8gqujfa Ek Rh6 Oh2dp6 Flr4vopezi37xvvodeenienswwosatwx3t7rl0sfya5pgiee532nsasohyxj6i5oerxobrlz4xgki2zckmaqqkwwwutmncfnicxaoazhdwpmg
No ratings yet
Apznzay5vyj1g6gkah Kmbaixbpduyak6bcwuvl7ninq7zt7srgn 19bdjz0i5mveqgxmyzs4sqz261v5rbp8gqujfa Ek Rh6 Oh2dp6 Flr4vopezi37xvvodeenienswwosatwx3t7rl0sfya5pgiee532nsasohyxj6i5oerxobrlz4xgki2zckmaqqkwwwutmncfnicxaoazhdwpmg
6 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Machine Learning
No ratings yet
Machine Learning
13 pages
Stuvia 2830580 Test Bank For International Economics Theory and Policy 11th Edition Krugman All Chapters 1 22 Full Complete 2023 2024 1.Pdf5
100% (1)
Stuvia 2830580 Test Bank For International Economics Theory and Policy 11th Edition Krugman All Chapters 1 22 Full Complete 2023 2024 1.Pdf5
9 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Chapter1 Introduction Java 2024
No ratings yet
Chapter1 Introduction Java 2024
61 pages
Chapter4-Blockchain Application Design
No ratings yet
Chapter4-Blockchain Application Design
17 pages
Exponential Smoothing Ovherview
No ratings yet
Exponential Smoothing Ovherview
4 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Spectral Clustering
No ratings yet
Spectral Clustering
4 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
Chapter 1summary Request
No ratings yet
Chapter 1summary Request
4 pages
NoteGPT Flashcards 1739123443917
No ratings yet
NoteGPT Flashcards 1739123443917
10 pages
PDF
No ratings yet
PDF
5 pages
Chap1-3 (IA) Complexity - BigO
No ratings yet
Chap1-3 (IA) Complexity - BigO
104 pages
Chap1-2 (IA) Complexity - Examples
No ratings yet
Chap1-2 (IA) Complexity - Examples
167 pages
PracticeQuestions Final
No ratings yet
PracticeQuestions Final
9 pages
누리-세종학당 온라인 한국어 레벨테스트 시스템 Test
No ratings yet
누리-세종학당 온라인 한국어 레벨테스트 시스템 Test
2 pages
International Trade Insights - Scholarly Flashcards
No ratings yet
International Trade Insights - Scholarly Flashcards
4 pages
Optimal Clusters in Data Sets
No ratings yet
Optimal Clusters in Data Sets
6 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
Ai and ML qp1 Solved
No ratings yet
Ai and ML qp1 Solved
20 pages
07 Clustering 2024
No ratings yet
07 Clustering 2024
51 pages
ML Project Report
No ratings yet
ML Project Report
22 pages
K Means - Ipynb - Colab
No ratings yet
K Means - Ipynb - Colab
10 pages
Assignment - Unit-5 Answers
No ratings yet
Assignment - Unit-5 Answers
6 pages

Chapter 4 - Clustering

Uploaded by

Chapter 4 - Clustering

Uploaded by

Chapter 4 : Clustering

Credits to Y.Chalgham and Yassine Ghouil Kmar Abessi

● Cluster analysis has been extensively focused on distance-based cluster analysis

● Simple and fast. ● Requires specifying .

● Replace mean of cluster to Mode of data

Data Analysis: Clustering

1. Clusters Are Not Always Gaussian-Shaped

2. Clusters Should Have Similar Variance

3. Clusters Can Be Highly Imbalanced in Size

When to Avoid K-Means

● Data with non-spherical clusters.

1. Agglomerative (Bottom-Up): Start with each point as a cluster and merge.

Agglomerative Clustering ( Bootom-up) :

e. Ward’s Linkage : #kal karawah wahadkom

● No need to predefine number of ● Computationally expensive.

How to cut the dendrogram to identify the number of clusters :

a. Specify the Number of Clusters

● Decide beforehand how many clusters you want (e.g., kkk).

C.Highest Jump (Elbow Method in Dendrograms)

ELBOW METHOD: “me trakazesh fehom yesser

● calculated using “within cluster sum of square WCSS”

Now how to evaluate it ?

Gap statistic clustering:

● Epsilon (ε): Maximum distance between two points to be considered neighbors.

2. Expand clusters by connecting core points and their neighbors.

● Detects clusters of arbitrary shapes. ● Choosing ε and MinPts can be

5. K-Nearest Neighbor (k-NN)

Algorithm Steps (for Classification):

● Simple and intuitive. ● Computationally expensive for

➔ K-Nearest Neighbor - Breast Cancer Example Summary

7. Gaussian Mixture Models (GMMs) #kal akrawah wahadkom

To evaluate the quality of clustering results and compare algorithms.

You might also like