Unit-IV - Unsupervised Learning
Unit-IV - Unsupervised Learning
BTCSE023602
UNIT-IV
UNSUPERVISED LEARNING
2
CONTENTS
• Clustering Algorithms
• Dimensionality Reduction
• Anomaly Detection
3
CLUSTERING
4
CLUSTERING
• Clustering or cluster analysis is a machine learning technique,
which groups the unlabeled dataset.
5
CLUSTERING
• It does it by finding some similar patterns in the unlabelled
dataset such as shape, size, color, behavior, etc., and divides
them as per the presence and absence of those similar
patterns.
6
CLUSTERING
• Example: Let's understand the clustering technique with the
real-world example of Mall: When we visit any shopping mall,
we can observe that the things with similar usage are grouped
together.
7
CLUSTERING
• The below diagram explains the working of the clustering
algorithm. We can see the different fruits are divided into
several groups with similar properties.
8
CLUSTERING
• HARD CLUSTERING
• For example, Let’s say there are 4 data point and we have to
cluster them into 2 clusters.
10
CLUSTERING
• SOFT CLUSTERING
• In this type of clustering, instead of assigning each data
point into a separate cluster, a probability or likelihood of
that point being that cluster is evaluated.
• For example, Let’s say there are 4 data point and we have to
cluster them into 2 clusters.
• So we will be evaluating a probability of a data point
belonging to both clusters. This probability is calculated for all
data points.
11
CLUSTERING
• SOFT CLUSTERING
12
CLUSTERING
• TYPES OF CLUSTERING
• Partitioning Clustering
• Density-Based Clustering
• Hierarchical Clustering
• Fuzzy Clustering
13
CLUSTERING
• PARTITIONING CLUSTERING
15
CLUSTERING
• PARTITIONING CLUSTERING
16
CLUSTERING
• DENSITY BASED CLUSTERING
18
CLUSTERING
• DISTRIBUTION MODEL-BASED CLUSTERING
19
CLUSTERING
• DISTRIBUTION MODEL-BASED CLUSTERING
20
CLUSTERING
• HIERARCHICAL CLUSTERING
21
CLUSTERING
• HIERARCHICAL CLUSTERING
22
CLUSTERING
• FUZZY CLUSTERING
23
K-MEAN CLUSTERING ALGORITHM
• INTRODUCTION
• Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.
• The main aim of this algorithm is to minimize the sum of distances between the
24
data point and their corresponding clusters.
K-MEAN CLUSTERING ALGORITHM
• INTRODUCTION
• Assigns each data point to its closest k-center. Those data points which
are near to the particular k-center, create a cluster.
25
K-MEAN CLUSTERING ALGORITHM
26
K-MEAN CLUSTERING ALGORITHM
• WORKING OF K-MEANS CLUSTERING ALGORITHM
• Step-2: Select random K points or centroids. (It can be other from the input
dataset).
• Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
• Step-4: Calculate the variance and place a new centroid of each cluster.
• Step-5: Repeat the third steps, which means reassign each data point to the new
closest centroid of each cluster.
28
K-MEANS EXAMPLES
29
K-MEAN CLUSTERING ALGORITHM
C1 = {2+4+3}=9/3=3
C2 =
{10+12+20+30+11+25}=108/6=18
30
K-MEAN CLUSTERING ALGORITHM
C1 = {2+4+10+3}=19/4=4.75
C2 = {12+20+30+11+25}=98/5=19.6
31
K-MEAN CLUSTERING ALGORITHM
C1 = {2+4+10+11+12+3}=42/7=7
C2 = {20+30+25}=75/3=25
32
K-MEAN CLUSTERING ALGORITHM
33
K-MEANS EXAMPLES TO BE
SOLVED BY STUDENTS
34
K-MEAN CLUSTERING ALGORITHM
• EXAMPLE 1
35
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Initial Centroids Points
M1 M2
M1: 5
M2: 11 3 2 8 C1
5 0 6 C1
Clusters 11 6 0 C2
C1: {3,5,4}
13 8 2 C2
C2: {11,13,21,31,11,25}
4 1 7 C1
C1 = {3+5+4}=12/3=4 21 16 10 C2
C2 = 31 26 20 C2
{11+13+21+31+11+25}=112/6=18. 11 6 0 C2
67 25 20 14 C2
New Centroids
M1: 4
M2: 18.67 36
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Current Centroids Points
M1 M2
M1: 4
M2: 18.67 3 1 15.67 C1 C1
5 1 13.67 C1 C1
Clusters 11 7 7.67 C2 C1
C1: {3,5,11,4,11}
C2: {13,21,31,25} 13 9 5.67 C2 C2
4 0 14.67 C1 C1
C1 = {3+5+11+4+11}=34/5=6.8 21 17 2.33 C2 C2
C2 = {13+21+31+25}=90/4=22.5 31 27 12.33 C2 C2
11 7 7.67 C2 C1
25 21 6.33 C2 C2
New Centroids
M1: 6.8
M2: 22.5 37
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Current Centroids Points
M1 M2
M1: 6.8
M2: 22.5 3 3.8 19.5 C1 C1
5 1.8 17.5 C1 C1
Clusters 11 4.2 11.5 C1 C1
C1: {3,5,11,13,4,11}
13 6.2 9.5 C2 C1
C2: {21,31,25}
4 2.8 18.5 C1 C1
C1 = 21 14.2 1.5 C2 C2
{3+5+11+13+4+11}=47/6=7.83 31 24.2 8.5 C2 C2
C2 = {21+31+25}=77/3=25.67 11 4.2 11.5 C1 C1
25 18.2 2.5 C2 C2
New Centroids
M1: 7.83
M2: 25.67 38
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Current Centroids Points
M1 M2
M1: 7.83
M2: 25.67 3 4.83 22.67 C1 C1
5 2.83 20.67 C1 C1
11 3.17 14.67 C1 C1
Clusters
C1: {3,5,11,13,4,11} 13 5.17 12.67 C1 C1
C2: {21,31,25} 4 3.83 21.67 C1 C1
21 13.17 4.67 C2 C2
31 23.17 5.33 C2 C2
11 3.17 14.67 C1 C1
25 17.17 0.67 C2 C2
39
K-MEAN CLUSTERING ALGORITHM
• EXAMPLE 2
40
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Initial Centroids Points
M1 M2
M1: 4
M2: 10 4 0 6 C1
5 1 5 C1
12 8 2 C2
Clusters
C1: {4,5,3} 11 7 1 C2
C2: {12,11,20,28,10,24} 3 1 7 C1
20 16 10 C2
New Centroids 28 24 18 C2
M1: 4 10 6 0 C2
M2: 17.5 24 20 14 C2
41
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Current Centroids Points
M1 M2
M1: 4
M2: 17.5 4 0 13.5 C1 C1
5 1 12.5 C1 C1
12 8 5.5 C2 C2
Clusters
C1: {4,5,3,10} 11 7 6.5 C2 C2
C2: {12,11,20,28,24} 3 1 14.5 C1 C1
20 16 2.5 C2 C2
New Centroids 28 24 10.5 C2 C2
M1: 5.5 10 6 7.5 C2 C1
M2: 19 24 20 5 C2 C2
42
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Current Centroids Points
M1 M2
M1: 5.5
M2: 19 4 1.5 15 C1 C1
5 0.5 14 C1 C1
12 6.5 7 C1 C1
Clusters
C1: {4,5,12,11,3,10} 11 5.5 8 C1 C1
C2: {20,28,24} 3 2.5 16 C1 C1
20 14.5 1 C2 C2
New Centroids 28 22.5 9 C2 C2
M1: 6.6 10 4.5 9 C1 C1
M2: 24 24 18.5 5 C2 C2
43
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Current Centroids Points
M1 M2
M1: 6.6
M2: 24 4 2.6 20 C1 C1
5 1.6 19 C1 C1
12 5.4 12 C1 C1
Clusters
C1: {4,5,12,11,3,10} 11 4.4 13 C1 C1
C2: {20,28,24} 3 3.6 21 C1 C1
20 13.4 4 C2 C2
New Centroids 28 21.4 4 C2 C2
M1: 6.6 10 3.4 14 C1 C1
M2: 24 24 17.4 0 C2 C2
44
HIERARCHICAL CLUSTERING
ALGORITHM
(HCA)
45
HIERARCHICAL CLUSTERING ALGORITHM
• INTRODUCTION
48
AGGLOMERATIVE
HIERARCHICAL CLUSTERING
49
HIERARCHICAL CLUSTERING ALGORITHM
• To group the datasets into clusters, it follows the bottom-up
approach.
• It does this until all the clusters are merged into a single
cluster that contains all the datasets.
51
HIERARCHICAL CLUSTERING ALGORITHM
• WORKING OF AGGLOMERATIVE HCA
52
HIERARCHICAL CLUSTERING ALGORITHM
• WORKING OF AGGLOMERATIVE HCA
• Step-3: Again, take the two closest clusters and merge them
together to form one cluster. There will be N-2 clusters.
53
HIERARCHICAL CLUSTERING ALGORITHM
• WORKING OF AGGLOMERATIVE HCA
• Step-4: Repeat Step 3 until only one cluster left. So, we will
get the following clusters. Consider the below images:
54
HIERARCHICAL CLUSTERING ALGORITHM
• WORKING OF AGGLOMERATIVE HCA
• Step-5: Once all the clusters are combined into one big
cluster, develop the dendrogram to divide the clusters as per
the problem.
55
AGGLOMERATIVE
HIERARCHICAL CLUSTERING
EXAMPLE
56
HIERARCHICAL CLUSTERING ALGORITHM
• EXAMPLE
57
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-1: Create a matrix of given datapoints and find
minimum distance.
18 22 25 27 42 43
18 0 4 7 9 24 25
22 4 0 3 5 20 21
25 7 3 0 2 17 18
27 9 5 2 0 15 16
42 24 20 17 15 0 1
43 25 21 18 16 1 0 58
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-2: Merge the minimum distance datapoints into single
cluster
• Cluster 1 – (42,43)
18 22 25 27 42 43 (42,4
18 22 25 27
3)
18 0 4 7 9 24 25
18 0 4 7 9 24
22 4 0 3 5 20 21
22 4 0 3 5 20
25 7 3 0 2 17 18
25 7 3 0 2 17
27 9 5 2 0 15 16
27 9 5 2 0 15
42 24 20 17 15 0 1
(42,4
24 20 17 15 0
43 25 21 18 16 1 0 3)
59
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-3: Repeat step 2 and merge the minimum distance
datapoints into single cluster
18 22 25 27 (42,43)
18 0 4 7 9 24
22 4 0 3 5 20
25 7 3 0 2 17
27 9 5 2 0 15
(42,43) 24 20 17 15 0
60
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-3: Repeat step 2 and merge the minimum distance
datapoints into single cluster
18 22 (25,27) (42,43)
18 0 4 7 24
22 4 0 3 20
(25,27) 7 3 0 17
(42,43) 24 20 17 0
62
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-4: Repeat step 2 and merge the minimum distance
datapoints into single cluster
(42,43) 24 20 17 0 (42,43) 24 20 0
63
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-5: Repeat step 2 and merge the minimum distance
datapoints into single cluster
18 (25,27,22) (42,43)
18 0 4 24
(25,27,22) 4 0 20
(42,43) 24 20 0
64
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-5: Repeat step 2 and merge the minimum distance
datapoints into single cluster
18 (25,27,22) (42,43)
(42,43) 24 20 0
(25,27,22,18) (42,43)
(25,27,22,18) 0 24
(42,43) 24 0 65
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-5: Repeat step 2 and merge the minimum distance
datapoints into single cluster
(25,27,22,18) (42,43)
(25,27,22,18) 0 24
(42,43) 24 0
66
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-6: Now convert the final clusters into dendrogram
(25,27,22,18,42,43)
(25,27,22,18,42,43) 0
67
HIERARCHICAL CLUSTERING ALGORITHM
• DENDROGRAM - [(42,43),(((25,27),22),18)]
Final Cluster
Cluster 4
Cluster 3
Cluster 2
Cluster 1
68
AGGLOMERATIVE
HIERARCHICAL CLUSTERING
EXAMPLE FOR STUDENTS
69
HIERARCHICAL CLUSTERING ALGORITHM
• EXAMPLE
• 20,24,27,44,29,45
70
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-1: Create a matrix of given datapoints and find
minimum distance.
20 24 27 29 44 45
20 0 4 7 9 24 25
24 4 0 3 5 20 21
27 7 3 0 2 17 18
29 9 5 2 0 15 16
44 24 20 17 15 0 1
45 25 21 18 16 1 0 71
DBSCAN ALGORITHM
72
DBSCAN ALGORITHM
• INTRODUCTION
• Border Points: These are points that are within the ε distance
of a core point but don't have MinPts neighbors themselves.
• Noise Points: These are points that are neither core points
nor border points. They're not close enough to any cluster to
be included.
74
DBSCAN ALGORITHM
75
DBSCAN ALGORITHM
• PARAMETERS IN DBSCAN
76
DBSCAN ALGORITHM
77
DBSCAN ALGORITHM
• WORKING OF DBSCAN
78
DBSCAN ALGORITHM
• STEP-3: Examine the Neighborhood
– It retrieves all points within the ε distance of the starting point.
79
DBSCAN ALGORITHM
• STEP-4: Expand the Cluster
– All the neighbors of the core point are added to the cluster.
• If it's not a core point, it's marked as a border point, and the
expansion stops.
80
DBSCAN ALGORITHM
• STEP-5: Repeat the Process
– The algorithm moves to the next unvisited point in the dataset.
– Steps 3-4 are repeated until all points have been visited.
82
DBSCAN ALGORITHM
EXAMPLE
83
DBSCAN ALGORITHM
84
DBSCAN ALGORITHM
85
DBSCAN ALGORITHM
• STEP-1: Create proximity matrix to calculate min distance between each data
point.
87
DBSCAN ALGORITHM
• STEP-3: Identify CORE, BORDER and NOISE data points.
89
DBSCAN ALGORITHM
EXAMPLE TO SOLVE
90
DBSCAN ALGORITHM
• Apply the DBSCAN algorithm to the given data points.
POINT COORDINATES
P1 4 8
P2 5 7
P3 6 6
P4 7 5
P5 8 3
P6 7 3
P7 8 3
P8 9 5
P9 4 4
P10 3 7
P11 4 6
P12 3 5
91
DIMENSIONALITY
REDUCTION
92
DIMENSIONALITY REDUCTION
• DIMENSIONALITY REDUCTION
93
DIMENSIONALITY REDUCTION
• INTRODUCTION
• FEATURE SELECTION
• FEATURE EXTRACTION
98
PRINCIPAL COMPONENT ANALYSIS (PCA)
• INTRODUCTION
100
PRINCIPAL COMPONENT ANALYSIS (PCA)
• COMMON TERMS IN PCA
• Calculating the Eigen Values and Eigen Vectors: Now we need to calculate
the eigenvalues and eigenvectors for the resultant covariance matrix Z.
Eigenvectors or the covariance matrix are the directions of the axes with
high information. And the coefficients of these eigenvectors are defined
as the eigenvalues.
104
PRINCIPAL COMPONENT ANALYSIS (PCA)
• WORKING OF PCA
• Sorting the Eigen Vectors: In this step, we will take all the
eigenvalues and will sort them in decreasing order, which
means from largest to smallest.
105
PRINCIPAL COMPONENT ANALYSIS (PCA)
• WORKING OF PCA
106
PRINCIPAL COMPONENT ANALYSIS (PCA)
• WORKING OF PCA
107
PRINCIPAL COMPONENT
ANALYSIS (PCA) SOLVED
EXAMPLES
108
PRINCIPAL COMPONENT ANALYSIS (PCA)
• 2D TRANSFORMATIONS [2 DIMENSIONAL]
• 2D name is given so because it has two axis X and Y.
Y-axis
The object can
be seen in only
one view
FRONT VIEW
0
X-axis
22-05-2025
PRINCIPAL COMPONENT ANALYSIS (PCA)
• TYPES OF 2D TRANSFORMATION
• 1. TRANSLATION
• 2. SCALING
22-05-2025
PRINCIPAL COMPONENT ANALYSIS (PCA)
• 1. TRANSLATION [T]
• It is a process of changing the position of object in a straight-
line path from one coordinate location to another
Y-axis
P’ (x’ , y’) x’ = x + tx
ty y’ = y + ty
The translation distance
P (x , y)
tx pair (tx,ty) is called as
P’ = P + T
𝑥 𝑥′ 𝑡𝑥
𝑃= 𝑦 𝑃′ = T=
𝑦′ 𝑡𝑦
22-05-2025
PRINCIPAL COMPONENT ANALYSIS (PCA)
𝑥
𝑃= 𝑦 𝐵′ = 𝐵 + 𝑇 =
7
+
3
=
10
10 4 14
𝑡𝑥
T= 10 3 13
𝑡𝑦 𝐶′ = 𝐶 + 𝑇 =
2
+
4
=
6
22-05-2025
PRINCIPAL COMPONENT ANALYSIS (PCA)
• OUTPUT
Y-axis Y-axis
15 15 B’
B
10 10
A’
C’
5 A 5
C
X-axis X-axis
0 0
5 10 15 5 10 15
22-05-2025
PRINCIPAL COMPONENT ANALYSIS (PCA)
• 2. SCALING [S]
• This transformation changes the size of an object.
x’ = x * SX y’ = y * SY
22-05-2025
PRINCIPAL COMPONENT ANALYSIS (PCA)
P’ = P x S
𝑥 𝑥′ 𝑆𝑥 0
𝑃= 𝑦 𝑃′ = 𝑆=
𝑦′ 0 𝑆𝑦
22-05-2025
PRINCIPAL COMPONENT ANALYSIS (PCA)
• For both Sx and Sy if the value is 1 then the size of the object
remains same.
P’ = P x S
𝑥 𝑥′ 2 5 4 10
𝑦 2 0
𝑃= 𝑃′ = 𝑦 ′ 𝐴′ = 𝐴 ∗ 𝑆 = 7 10 = 14 20
0 2
10 2 20 4
𝑆𝑥 0
𝑆=
0 𝑆𝑦
22-05-2025
PRINCIPAL COMPONENT
ANALYSIS (PCA) EXAMPLE
TO SOLVE
119
PRINCIPAL COMPONENT ANALYSIS (PCA)
22-05-2025
LINEAR DISCRIMINANT
ANALYSIS (LDA)
121
LINEAR DISCRIMINANT ANALYSIS (LDA)
22-05-2025
LINEAR DISCRIMINANT ANALYSIS (LDA)
22-05-2025
LINEAR DISCRIMINANT ANALYSIS (LDA)
22-05-2025
LINEAR DISCRIMINANT ANALYSIS (LDA)
• EXAMPLE
22-05-2025
LINEAR DISCRIMINANT ANALYSIS (LDA)
• EXAMPLE
22-05-2025
LINEAR DISCRIMINANT ANALYSIS (LDA)
22-05-2025
ANOMOLY DETECTION
128
ANOMOLY DETECTION
22-05-2025
ANOMOLY DETECTION
22-05-2025
ISOLATION FOREST
131
ISOLATION FOREST
• It operates under the principle that anomalies are rare and distinct,
making them easier to isolate from the rest of the data.
• These trees are similar to traditional decision trees, but with a key
difference: they are not built to classify data points into specific
categories.
– This process continues recursively until the data point reaches a leaf
node, which simply represents the isolated data point.
22-05-2025
ISOLATION FOREST
22-05-2025
ISOLATION FOREST
22-05-2025
ISOLATION FOREST
140
ISOLATION FOREST
• OUTLIER (Anomaly)
• These are the values in dataset which standout from the rest
of the data.
• 2. Z- Score
22-05-2025
IQR SCORE ANOMOLY
DETECTION METHOD
142
ISOLATION FOREST
• IQR EXAMPLE
22-05-2025
ISOLATION FOREST
• IQR EXAMPLE
22-05-2025
ISOLATION FOREST
• IQR EXAMPLE
22-05-2025
ISOLATION FOREST
• IQR EXAMPLE
• We will remove the first and last point in this dataset (-50 & 1456)
22-05-2025
Z-SCORE ANOMOLY
DETECTION METHOD
147
ISOLATION FOREST
• Z SCORE EXAMPLE
• Z SCORE EXAMPLE
22-05-2025
ISOLATION FOREST
• Z SCORE EXAMPLE
22-05-2025
ISOLATION FOREST
• Z SCORE EXAMPLE
22-05-2025
ISOLATION FOREST
• Z SCORE EXAMPLE
22-05-2025
IQR & Z-SCORE ANOMOLY
DETECTION EXAMPLES
153
ISOLATION FOREST
• IQR EXAMPLE
• [10,20,15,-30,22,60,45,-24,55,81]
• Z SCORE EXAMPLE
• [100,150,120,200,250,101,230,330,450,500]
22-05-2025