KEMBAR78
Unit-IV - Unsupervised Learning | PDF | Principal Component Analysis | Cluster Analysis
0% found this document useful (0 votes)
31 views154 pages

Unit-IV - Unsupervised Learning

This document covers Unit IV of a Machine Learning course, focusing on Unsupervised Learning techniques. It includes topics such as clustering algorithms, dimensionality reduction, and anomaly detection, with a detailed explanation of clustering methods like hard clustering, soft clustering, and various types of clustering algorithms including K-Means. The document aims to equip students with the ability to analyze and implement these techniques effectively.

Uploaded by

ap6859343
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views154 pages

Unit-IV - Unsupervised Learning

This document covers Unit IV of a Machine Learning course, focusing on Unsupervised Learning techniques. It includes topics such as clustering algorithms, dimensionality reduction, and anomaly detection, with a detailed explanation of clustering methods like hard clustering, soft clustering, and various types of clustering algorithms including K-Means. The document aims to equip students with the ability to analyze and implement these techniques effectively.

Uploaded by

ap6859343
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 154

MACHINE LEARNING (ML)

BTCSE023602

UNIT-IV
UNSUPERVISED LEARNING

Mr. Naresh A. Kamble


Asst. Professor,
Department of Computer Science and Engineering
D.Y.Patil Agriculture and Technical University, Talsande, Kolhapur
COURSE OUTCOME
After end of this unit, students will be able to

CO3: Analyze and Implement Unsupervised Learning


Techniques.

2
CONTENTS
• Clustering Algorithms

• Dimensionality Reduction

• Anomaly Detection

3
CLUSTERING

4
CLUSTERING
• Clustering or cluster analysis is a machine learning technique,
which groups the unlabeled dataset.

• It can be defined as "A way of grouping the data points into


different clusters, consisting of similar data points. The objects
with the possible similarities remain in a group that has less or
no similarities with another group.“

5
CLUSTERING
• It does it by finding some similar patterns in the unlabelled
dataset such as shape, size, color, behavior, etc., and divides
them as per the presence and absence of those similar
patterns.

• The clustering technique is commonly used for statistical data


analysis.

6
CLUSTERING
• Example: Let's understand the clustering technique with the
real-world example of Mall: When we visit any shopping mall,
we can observe that the things with similar usage are grouped
together.

• Such as the t-shirts are grouped in one section, and trousers


are at other sections, similarly, at vegetable sections, apples,
bananas, Mangoes, etc., are grouped in separate sections, so
that we can easily find out the things.

7
CLUSTERING
• The below diagram explains the working of the clustering
algorithm. We can see the different fruits are divided into
several groups with similar properties.

8
CLUSTERING
• HARD CLUSTERING

• In this type of clustering, each data point belongs to a cluster


completely or not.

• For example, Let’s say there are 4 data point and we have to
cluster them into 2 clusters.

• So each data point will either belong to cluster 1 or cluster 2.


9
CLUSTERING
• HARD CLUSTERING

10
CLUSTERING
• SOFT CLUSTERING
• In this type of clustering, instead of assigning each data
point into a separate cluster, a probability or likelihood of
that point being that cluster is evaluated.
• For example, Let’s say there are 4 data point and we have to
cluster them into 2 clusters.
• So we will be evaluating a probability of a data point
belonging to both clusters. This probability is calculated for all
data points.
11
CLUSTERING
• SOFT CLUSTERING

12
CLUSTERING
• TYPES OF CLUSTERING

• Partitioning Clustering

• Density-Based Clustering

• Distribution Model-Based Clustering

• Hierarchical Clustering

• Fuzzy Clustering

13
CLUSTERING
• PARTITIONING CLUSTERING

• It is a type of clustering that divides the data into non-


hierarchical groups.

• It is also known as the centroid-based method.

• The most common example of partitioning clustering is the K-


Means Clustering algorithm.
14
CLUSTERING
• PARTITIONING CLUSTERING

• In this type, the dataset is divided into a set of k groups,


where K is used to define the number of pre-defined groups.

• The cluster center is created in such a way that the distance


between the data points of one cluster is minimum as
compared to another cluster centroid.

15
CLUSTERING
• PARTITIONING CLUSTERING

16
CLUSTERING
• DENSITY BASED CLUSTERING

• The density-based clustering method connects the highly-


dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be
connected.

• This algorithm does it by identifying different clusters in the


dataset and connects the areas of high densities into clusters.
17
CLUSTERING
• DENSITY BASED CLUSTERING

18
CLUSTERING
• DISTRIBUTION MODEL-BASED CLUSTERING

• In the distribution model-based clustering method, the data


is divided based on the probability of how a dataset belongs
to a particular distribution.

19
CLUSTERING
• DISTRIBUTION MODEL-BASED CLUSTERING

20
CLUSTERING
• HIERARCHICAL CLUSTERING

• In this technique, the dataset is divided into clusters to create


a tree-like structure, which is also called a dendrogram.

• The observations or any number of clusters can be selected by


cutting the tree at the correct level.

21
CLUSTERING
• HIERARCHICAL CLUSTERING

22
CLUSTERING
• FUZZY CLUSTERING

• Fuzzy clustering is a type of soft method in which a data


object may belong to more than one group or cluster.

• Fuzzy C-means algorithm is the example of this type of


clustering; it is sometimes also known as the Fuzzy k-means
algorithm.

23
K-MEAN CLUSTERING ALGORITHM
• INTRODUCTION

• K-Means Clustering is an Unsupervised Learning algorithm, which groups the


unlabeled dataset into different clusters.

• Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.

• It is an iterative algorithm that divides the unlabeled dataset into k different


clusters in such a way that each dataset belongs only one group that has similar
properties.

• It is a centroid-based algorithm, where each cluster is associated with a centroid.

• The main aim of this algorithm is to minimize the sum of distances between the
24
data point and their corresponding clusters.
K-MEAN CLUSTERING ALGORITHM
• INTRODUCTION

• The k-means clustering algorithm mainly performs two tasks:

• Determines the best value for K center points or centroids by an iterative


process.

• Assigns each data point to its closest k-center. Those data points which
are near to the particular k-center, create a cluster.

25
K-MEAN CLUSTERING ALGORITHM

26
K-MEAN CLUSTERING ALGORITHM
• WORKING OF K-MEANS CLUSTERING ALGORITHM

• Step-1: Select the number K to decide the number of clusters.

• Step-2: Select random K points or centroids. (It can be other from the input
dataset).

• Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.

• Step-4: Calculate the variance and place a new centroid of each cluster.

• Step-5: Repeat the third steps, which means reassign each data point to the new
closest centroid of each cluster.

• Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

• Step-7: The model is ready.


27
K-MEAN CLUSTERING ALGORITHM
• EXAMPLE

28
K-MEANS EXAMPLES

29
K-MEAN CLUSTERING ALGORITHM

C1 = {2+4+3}=9/3=3
C2 =
{10+12+20+30+11+25}=108/6=18

30
K-MEAN CLUSTERING ALGORITHM

C1 = {2+4+10+3}=19/4=4.75
C2 = {12+20+30+11+25}=98/5=19.6

31
K-MEAN CLUSTERING ALGORITHM

C1 = {2+4+10+11+12+3}=42/7=7
C2 = {20+30+25}=75/3=25

32
K-MEAN CLUSTERING ALGORITHM

33
K-MEANS EXAMPLES TO BE
SOLVED BY STUDENTS

34
K-MEAN CLUSTERING ALGORITHM
• EXAMPLE 1

• Data Points : { 3, 5, 11, 13, 4, 21, 31, 11, 25}


– Initial Centroids: M1-5 , M2-11

35
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Initial Centroids Points
M1 M2
M1: 5
M2: 11 3 2 8 C1
5 0 6 C1
Clusters 11 6 0 C2
C1: {3,5,4}
13 8 2 C2
C2: {11,13,21,31,11,25}
4 1 7 C1
C1 = {3+5+4}=12/3=4 21 16 10 C2
C2 = 31 26 20 C2
{11+13+21+31+11+25}=112/6=18. 11 6 0 C2
67 25 20 14 C2
New Centroids
M1: 4
M2: 18.67 36
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Current Centroids Points
M1 M2
M1: 4
M2: 18.67 3 1 15.67 C1 C1
5 1 13.67 C1 C1
Clusters 11 7 7.67 C2 C1
C1: {3,5,11,4,11}
C2: {13,21,31,25} 13 9 5.67 C2 C2
4 0 14.67 C1 C1
C1 = {3+5+11+4+11}=34/5=6.8 21 17 2.33 C2 C2
C2 = {13+21+31+25}=90/4=22.5 31 27 12.33 C2 C2
11 7 7.67 C2 C1
25 21 6.33 C2 C2
New Centroids
M1: 6.8
M2: 22.5 37
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Current Centroids Points
M1 M2
M1: 6.8
M2: 22.5 3 3.8 19.5 C1 C1
5 1.8 17.5 C1 C1
Clusters 11 4.2 11.5 C1 C1
C1: {3,5,11,13,4,11}
13 6.2 9.5 C2 C1
C2: {21,31,25}
4 2.8 18.5 C1 C1
C1 = 21 14.2 1.5 C2 C2
{3+5+11+13+4+11}=47/6=7.83 31 24.2 8.5 C2 C2
C2 = {21+31+25}=77/3=25.67 11 4.2 11.5 C1 C1
25 18.2 2.5 C2 C2
New Centroids
M1: 7.83
M2: 25.67 38
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Current Centroids Points
M1 M2
M1: 7.83
M2: 25.67 3 4.83 22.67 C1 C1
5 2.83 20.67 C1 C1
11 3.17 14.67 C1 C1
Clusters
C1: {3,5,11,13,4,11} 13 5.17 12.67 C1 C1
C2: {21,31,25} 4 3.83 21.67 C1 C1
21 13.17 4.67 C2 C2
31 23.17 5.33 C2 C2
11 3.17 14.67 C1 C1
25 17.17 0.67 C2 C2

39
K-MEAN CLUSTERING ALGORITHM
• EXAMPLE 2

• Data Points : { 4, 5, 12, 11, 3, 20, 28, 10, 24}


– Initial Centroids: M1-4 , M2-10

40
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Initial Centroids Points
M1 M2
M1: 4
M2: 10 4 0 6 C1
5 1 5 C1
12 8 2 C2
Clusters
C1: {4,5,3} 11 7 1 C2
C2: {12,11,20,28,10,24} 3 1 7 C1
20 16 10 C2
New Centroids 28 24 18 C2
M1: 4 10 6 0 C2
M2: 17.5 24 20 14 C2

41
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Current Centroids Points
M1 M2
M1: 4
M2: 17.5 4 0 13.5 C1 C1
5 1 12.5 C1 C1
12 8 5.5 C2 C2
Clusters
C1: {4,5,3,10} 11 7 6.5 C2 C2
C2: {12,11,20,28,24} 3 1 14.5 C1 C1
20 16 2.5 C2 C2
New Centroids 28 24 10.5 C2 C2
M1: 5.5 10 6 7.5 C2 C1
M2: 19 24 20 5 C2 C2

42
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Current Centroids Points
M1 M2
M1: 5.5
M2: 19 4 1.5 15 C1 C1
5 0.5 14 C1 C1
12 6.5 7 C1 C1
Clusters
C1: {4,5,12,11,3,10} 11 5.5 8 C1 C1
C2: {20,28,24} 3 2.5 16 C1 C1
20 14.5 1 C2 C2
New Centroids 28 22.5 9 C2 C2
M1: 6.6 10 4.5 9 C1 C1
M2: 24 24 18.5 5 C2 C2

43
K-MEAN CLUSTERING ALGORITHM
• Solution Data Distance to
Cluster New Cluster
Current Centroids Points
M1 M2
M1: 6.6
M2: 24 4 2.6 20 C1 C1
5 1.6 19 C1 C1
12 5.4 12 C1 C1
Clusters
C1: {4,5,12,11,3,10} 11 4.4 13 C1 C1
C2: {20,28,24} 3 3.6 21 C1 C1
20 13.4 4 C2 C2
New Centroids 28 21.4 4 C2 C2
M1: 6.6 10 3.4 14 C1 C1
M2: 24 24 17.4 0 C2 C2

44
HIERARCHICAL CLUSTERING
ALGORITHM
(HCA)

45
HIERARCHICAL CLUSTERING ALGORITHM
• INTRODUCTION

• Hierarchical clustering is another unsupervised machine


learning algorithm, which is used to group the unlabeled
datasets into a cluster and also known as hierarchical cluster
analysis or HCA.

• In this algorithm, we develop the hierarchy of clusters in the


form of a tree, and this tree-shaped structure is known as the
dendrogram.
46
HIERARCHICAL CLUSTERING ALGORITHM
• INTRODUCTION

• The hierarchical clustering technique has two approaches:

• Agglomerative: Agglomerative is a bottom-up approach, in


which the algorithm starts with taking all data points as single
clusters and merging them until one cluster is left.

• Divisive: Divisive algorithm is the reverse of the


agglomerative algorithm as it is a top-down approach. 47
HIERARCHICAL CLUSTERING ALGORITHM
• NEED OF HCA

• Problems with K-Means Algorithm

• Predetermined number of clusters

• It always tries to create the clusters of the same size.

48
AGGLOMERATIVE
HIERARCHICAL CLUSTERING

49
HIERARCHICAL CLUSTERING ALGORITHM
• To group the datasets into clusters, it follows the bottom-up
approach.

• It means, this algorithm considers each dataset as a single


cluster at the beginning, and then start combining the closest
pair of clusters together.

• It does this until all the clusters are merged into a single
cluster that contains all the datasets.

• This hierarchy of clusters is represented in the form of the


dendrogram.
50
HIERARCHICAL CLUSTERING ALGORITHM
• WORKING OF AGGLOMERATIVE HCA

• Step-1: Create each data point as a single cluster. Let's say


there are N data points, so the number of clusters will also be
N.

51
HIERARCHICAL CLUSTERING ALGORITHM
• WORKING OF AGGLOMERATIVE HCA

• Step-2: Take two closest data points or clusters and merge


them to form one cluster. So, there will now be N-1 clusters.

52
HIERARCHICAL CLUSTERING ALGORITHM
• WORKING OF AGGLOMERATIVE HCA

• Step-3: Again, take the two closest clusters and merge them
together to form one cluster. There will be N-2 clusters.

53
HIERARCHICAL CLUSTERING ALGORITHM
• WORKING OF AGGLOMERATIVE HCA

• Step-4: Repeat Step 3 until only one cluster left. So, we will
get the following clusters. Consider the below images:

54
HIERARCHICAL CLUSTERING ALGORITHM
• WORKING OF AGGLOMERATIVE HCA

• Step-5: Once all the clusters are combined into one big
cluster, develop the dendrogram to divide the clusters as per
the problem.

55
AGGLOMERATIVE
HIERARCHICAL CLUSTERING
EXAMPLE

56
HIERARCHICAL CLUSTERING ALGORITHM
• EXAMPLE

57
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-1: Create a matrix of given datapoints and find
minimum distance.

18 22 25 27 42 43

18 0 4 7 9 24 25

22 4 0 3 5 20 21

25 7 3 0 2 17 18

27 9 5 2 0 15 16

42 24 20 17 15 0 1

43 25 21 18 16 1 0 58
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-2: Merge the minimum distance datapoints into single
cluster

• Cluster 1 – (42,43)

18 22 25 27 42 43 (42,4
18 22 25 27
3)
18 0 4 7 9 24 25
18 0 4 7 9 24
22 4 0 3 5 20 21
22 4 0 3 5 20
25 7 3 0 2 17 18
25 7 3 0 2 17
27 9 5 2 0 15 16
27 9 5 2 0 15
42 24 20 17 15 0 1
(42,4
24 20 17 15 0
43 25 21 18 16 1 0 3)

59
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-3: Repeat step 2 and merge the minimum distance
datapoints into single cluster

18 22 25 27 (42,43)

18 0 4 7 9 24

22 4 0 3 5 20

25 7 3 0 2 17

27 9 5 2 0 15

(42,43) 24 20 17 15 0
60
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-3: Repeat step 2 and merge the minimum distance
datapoints into single cluster

• Cluster 1 & 2 – [(42,43),(25,27)]


(42,4
18 22 25 27
3)
18 22 (25,27) (42,43)
18 0 4 7 9 24
18 0 4 7 24
22 4 0 3 5 20
22 4 0 3 20
25 7 3 0 2 17
(25,27) 7 3 0 17
27 9 5 2 0 15
(42,4 (42,43) 24 20 17 0
24 20 17 15 0
3)
61
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-4: Repeat step 2 and merge the minimum distance
datapoints into single cluster

18 22 (25,27) (42,43)

18 0 4 7 24

22 4 0 3 20

(25,27) 7 3 0 17

(42,43) 24 20 17 0
62
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-4: Repeat step 2 and merge the minimum distance
datapoints into single cluster

• Cluster 1,2 & 3 – [(42,43),((25,27),22)]

18 22 (25,27) (42,43) (25,27,


18 (42,43)
22)
18 0 4 7 24
18 0 4 24
22 4 0 3 20
(25,27,
4 0 20
(25,27) 7 3 0 17 22)

(42,43) 24 20 17 0 (42,43) 24 20 0

63
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-5: Repeat step 2 and merge the minimum distance
datapoints into single cluster

18 (25,27,22) (42,43)

18 0 4 24

(25,27,22) 4 0 20

(42,43) 24 20 0
64
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-5: Repeat step 2 and merge the minimum distance
datapoints into single cluster
18 (25,27,22) (42,43)

18 0 4 24 Cluster 1,2,3 & 4 –


[(42,43),(((25,27),22),18)]
(25,27,22) 4 0 20

(42,43) 24 20 0

(25,27,22,18) (42,43)

(25,27,22,18) 0 24

(42,43) 24 0 65
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-5: Repeat step 2 and merge the minimum distance
datapoints into single cluster

(25,27,22,18) (42,43)

(25,27,22,18) 0 24

(42,43) 24 0
66
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-6: Now convert the final clusters into dendrogram

• Cluster 1,2,3 & 4 – [(42,43),(((25,27),22),18)]

(25,27,22,18,42,43)

(25,27,22,18,42,43) 0

67
HIERARCHICAL CLUSTERING ALGORITHM
• DENDROGRAM - [(42,43),(((25,27),22),18)]

Final Cluster

Cluster 4

Cluster 3

Cluster 2

Cluster 1

68
AGGLOMERATIVE
HIERARCHICAL CLUSTERING
EXAMPLE FOR STUDENTS

69
HIERARCHICAL CLUSTERING ALGORITHM
• EXAMPLE

• Consider the following set of 6 one dimensional data points:

• 20,24,27,44,29,45

70
HIERARCHICAL CLUSTERING ALGORITHM
• STEP-1: Create a matrix of given datapoints and find
minimum distance.

20 24 27 29 44 45

20 0 4 7 9 24 25

24 4 0 3 5 20 21

27 7 3 0 2 17 18

29 9 5 2 0 15 16

44 24 20 17 15 0 1

45 25 21 18 16 1 0 71
DBSCAN ALGORITHM

72
DBSCAN ALGORITHM
• INTRODUCTION

• DBSCAN, which stands for Density-Based Spatial Clustering of


Applications with Noise, is a powerful clustering algorithm
that groups points that are closely packed together in data
space.

• Unlike some other clustering algorithms, DBSCAN doesn't


require you to specify the number of clusters beforehand.

• The algorithm works by defining clusters as dense regions


separated by regions of lower density.
73
DBSCAN ALGORITHM
• KEY CONCEPTS

• Core Points: These are points that have at least a minimum


number of other points (MinPts) within a specified distance
(ε or epsilon).

• Border Points: These are points that are within the ε distance
of a core point but don't have MinPts neighbors themselves.

• Noise Points: These are points that are neither core points
nor border points. They're not close enough to any cluster to
be included.
74
DBSCAN ALGORITHM

75
DBSCAN ALGORITHM
• PARAMETERS IN DBSCAN

• ε (epsilon): The maximum distance between two points for


them to be considered as neighbors.

• MinPts: The minimum number of points required to form a


dense region.

76
DBSCAN ALGORITHM

77
DBSCAN ALGORITHM
• WORKING OF DBSCAN

• STEP-1: Parameter Selection


– Choose ε (epsilon): The maximum distance between two points for
them to be considered as neighbors.

– Choose MinPts: The minimum number of points required to form a


dense region.

• STEP-2: Select a Starting Point


– The algorithm starts with an arbitrary unvisited point in the dataset.

78
DBSCAN ALGORITHM
• STEP-3: Examine the Neighborhood
– It retrieves all points within the ε distance of the starting point.

– If the number of neighboring points is less than MinPts, the point is


labeled as noise (for now).

– If there are at least MinPts points within ε distance, the point is


marked as a core point, and a new cluster is formed.

79
DBSCAN ALGORITHM
• STEP-4: Expand the Cluster
– All the neighbors of the core point are added to the cluster.

– For each of these neighbors:

• If it's a core point, its neighbors are added to the cluster


recursively.

• If it's not a core point, it's marked as a border point, and the
expansion stops.

80
DBSCAN ALGORITHM
• STEP-5: Repeat the Process
– The algorithm moves to the next unvisited point in the dataset.

– Steps 3-4 are repeated until all points have been visited.

• STEP-6: Finalize Clusters


– After all points have been processed, the algorithm identifies all
clusters.

– Points initially labeled as noise might now be border points if they're


within ε distance of a core point.

• STEP-7: Handling Noise


– Any points not belonging to any cluster remain classified as noise. 81
DBSCAN ALGORITHM

82
DBSCAN ALGORITHM
EXAMPLE

83
DBSCAN ALGORITHM

84
DBSCAN ALGORITHM

85
DBSCAN ALGORITHM
• STEP-1: Create proximity matrix to calculate min distance between each data
point.

• MIN – 4 points and epsilon(€) = 1.9


POINT X1 Y1 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
P1 3 7 0
P2 4 6 1.41 0
P3 5 5 2.83 1.41 0
P4 6 4 4.24 2.83 1.41 0
P5 7 3 5.66 4.24 2.83 1.41 0
P6 6 2 5.83 4.47 3.16 2.00 1.41 0
P7 7 2 6.40 5.00 3.61 2.24 1.00 1.00 0
P8 8 4 5.83 4.47 3.16 2.00 1.41 2.83 2.24 0
P9 3 3 4.00 3.16 2.83 3.16 4.00 3.16 4.12 5.10 0
P10 2 6 1.41 2.00 3.16 4.47 5.83 5.66 6.40 6.32 3.16 0
P11 3 5 2.00 1.41 2.00 3.16 4.47 4.24 5.00 5.10 2.00 1.41 0
P12 2 4 3.16 2.83 3.16 4.00 5.10 4.47 5.39 6.00 1.41 2.00 1.41 86 0
DBSCAN ALGORITHM
• STEP-2: Find minimum distance of each data point with other data point

• MIN – 4 points and epsilon(€) = 1.9

POINT X1 Y1 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12


P1 3 7 0
POINT HORIZONTAL VERTICAL
P2 4 6 1.41 0 P1 P2 P10
P3 5 5 2.83 1.41 0 P2 P1 P3 P11
P3 P2 P4
P4 6 4 4.24 2.83 1.41 0
P4 P3 P5
P5 7 3 5.66 4.24 2.83 1.41 0 P5 P4 P6 P7 P8
P6 6 2 5.83 4.47 3.16 2.00 1.41 0 P6 P5 P7
P7 P5 P6
P7 7 2 6.40 5.00 3.61 2.24 1.00 1.00 0 P8 P5
P8 8 4 5.83 4.47 3.16 2.00 1.41 2.83 2.24 0 P9 P12
P10 P1 P11
P9 3 3 4.00 3.16 2.83 3.16 4.00 3.16 4.12 5.10 0
P11 P2 P10 P12
P10 2 6 1.41 2.00 3.16 4.47 5.83 5.66 6.40 6.32 3.16 0 P12 P9 P11
P11 3 5 2.00 1.41 2.00 3.16 4.47 4.24 5.00 5.10 2.00 1.41 0
P12 2 4 3.16 2.83 3.16 4.00 5.10 4.47 5.39 6.00 1.41 2.00 1.41 0

87
DBSCAN ALGORITHM
• STEP-3: Identify CORE, BORDER and NOISE data points.

• MIN – 4 points and epsilon(€) = 1.9

POINT HORIZONTAL VERTICAL POINT STATUS


P1 P2 P10 P1 NOISE BORDER
P2 P1 P3 P11 P2 CORE
P3 P2 P4 P3 NOISE BORDER
P4 P3 P5 P4 NOISE BORDER
P5 P4 P6 P7 P8 P5 CORE
P6 P5 P7 P6 NOISE BORDER
P7 P5 P6 P7 NOISE BORDER
P8 P5 P8 NOISE BORDER
P9 P12 P9 NOISE
P10 P1 P11 P10 NOISE BORDER
P11 P2 P10 P12 P11 CORE
88
P12 P9 P11 P12 NOISE BORDER
DBSCAN ALGORITHM
• STEP-4: Plot the clusters based on the CORE data points.

POINT HORIZONTAL VERTICAL


Cluster 1
P1 P2 P10
P2 P1 P3 P11
Cluster 3
P3 P2 P4
P4 P3 P5 Cluster 2
P5 P4 P6 P7 P8
P6 P5 P7
P7 P5 P6
P8 P5
P9 P12
P10 P1 P11
NOISE
P11 P2 P10 P12
P12 P9 P11

89
DBSCAN ALGORITHM
EXAMPLE TO SOLVE

90
DBSCAN ALGORITHM
• Apply the DBSCAN algorithm to the given data points.

• Create the clusters with Minpts = 3 and epsilon (€) = 1.5

POINT COORDINATES
P1 4 8
P2 5 7
P3 6 6
P4 7 5
P5 8 3
P6 7 3
P7 8 3
P8 9 5
P9 4 4
P10 3 7
P11 4 6
P12 3 5
91
DIMENSIONALITY
REDUCTION

92
DIMENSIONALITY REDUCTION
• DIMENSIONALITY REDUCTION

• The number of input features, variables, or columns present


in a given dataset is known as dimensionality, and the
process to reduce these features is called dimensionality
reduction.

93
DIMENSIONALITY REDUCTION
• INTRODUCTION

• A dataset contains a huge number of input features in various


cases, which makes the predictive modeling task more
complicated.

• Because it is very difficult to visualize or make predictions for


the training dataset with a high number of features, for such
cases, dimensionality reduction techniques are required to
use.
94
DIMENSIONALITY REDUCTION
• INTRODUCTION

• Dimensionality reduction technique can be defined as, "It is a


way of converting the higher dimensions dataset into lesser
dimensions dataset ensuring that it provides similar
information."

• It is commonly used in the fields that deal with high-


dimensional data, such as speech recognition, signal
processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
95
DIMENSIONALITY REDUCTION
• TYPES OF DIMENSIONALITY REDUCTION

• FEATURE SELECTION

• In this method, we are interested in finding k of the total of n


features that give us the most information and we discard the
other (n-k) dimensions.

• Example - Suppose we have 10 features (n=10) provided for a


dataset but we need only 7 features (k=7), then we will
discard the rest 3 features (n-k = 10-7 = 3).
96
DIMENSIONALITY REDUCTION
• TYPES OF DIMENSIONALITY REDUCTION

• FEATURE EXTRACTION

• In this method, we are interested in finding a new set of k


features that are combination of the original n features.

• There are 2 types of feature extraction techniques

• 1. Principal Component Analysis (PCA)

• 2. Linear Discriminant Analysis (LDA)


97
PRINCIPAL COMPONENT
ANALYSIS (PCA)

98
PRINCIPAL COMPONENT ANALYSIS (PCA)
• INTRODUCTION

• Principal Component Analysis is an unsupervised learning


algorithm that is used for the dimensionality reduction in
machine learning.

• It is a statistical process that converts the observations of


correlated features into a set of linearly uncorrelated
features with the help of orthogonal transformation.

• These new transformed features are called the Principal


Components.
99
PRINCIPAL COMPONENT ANALYSIS (PCA)
• COMMON TERMS IN PCA

• Dimensionality: It is the number of features or variables


present in the given dataset. More easily, it is the number of
columns present in the dataset.

• Correlation: It signifies that how strongly two variables are


related to each other. Such as if one changes, the other
variable also gets changed.

100
PRINCIPAL COMPONENT ANALYSIS (PCA)
• COMMON TERMS IN PCA

• Orthogonal: It defines that variables are not correlated to


each other, and hence the correlation between the pair of
variables is zero.

• Eigenvectors: If there is a square matrix M, and a non-zero


vector v is given. Then v will be eigenvector if Av is the scalar
multiple of v.

• Covariance Matrix: A matrix containing the covariance


between the pair of variables is called the Covariance Matrix.
101
PRINCIPAL COMPONENT ANALYSIS (PCA)
• WORKING OF PCA

• Getting the dataset: Firstly, we need to take the input dataset


and divide it into two subparts X and Y, where X is the
training set, and Y is the validation set.

• Representing data into a structure: Now we will represent


our dataset into a structure. Such as we will represent the
two-dimensional matrix of independent variable X. Here
each row corresponds to the data items, and the column
corresponds to the Features.
102
PRINCIPAL COMPONENT ANALYSIS (PCA)
• WORKING OF PCA

• Standardizing the data: In this step, we will standardize our


dataset. Such as in a particular column, the features with high
variance are more important compared to the features with
lower variance.

• If the importance of features is independent of the variance


of the feature, then we will divide each data item in a column
with the standard deviation of the column. Here we will
name the matrix as Z.
103
PRINCIPAL COMPONENT ANALYSIS (PCA)
• WORKING OF PCA
• Calculating the Covariance of Z: To calculate the covariance of Z, we will
take the matrix Z, and will transpose it. After transpose, we will multiply it
by Z. The output matrix will be the Covariance matrix of Z.

• Calculating the Eigen Values and Eigen Vectors: Now we need to calculate
the eigenvalues and eigenvectors for the resultant covariance matrix Z.
Eigenvectors or the covariance matrix are the directions of the axes with
high information. And the coefficients of these eigenvectors are defined
as the eigenvalues.

104
PRINCIPAL COMPONENT ANALYSIS (PCA)
• WORKING OF PCA

• Sorting the Eigen Vectors: In this step, we will take all the
eigenvalues and will sort them in decreasing order, which
means from largest to smallest.

• And simultaneously sort the eigenvectors accordingly in


matrix P of eigenvalues. The resultant matrix will be named
as P*.

105
PRINCIPAL COMPONENT ANALYSIS (PCA)
• WORKING OF PCA

• Calculating the new features Or Principal Components: Here


we will calculate the new features.

• To do this, we will multiply the P* matrix to the Z.

• In the resultant matrix Z*, each observation is the linear


combination of original features.

• Each column of the Z* matrix is independent of each other.

106
PRINCIPAL COMPONENT ANALYSIS (PCA)
• WORKING OF PCA

• Remove less or unimportant features from the new dataset:


The new feature set has occurred, so we will decide here what
to keep and what to remove.

• It means, we will only keep the relevant or important


features in the new dataset, and unimportant features will
be removed out.

107
PRINCIPAL COMPONENT
ANALYSIS (PCA) SOLVED
EXAMPLES

108
PRINCIPAL COMPONENT ANALYSIS (PCA)
• 2D TRANSFORMATIONS [2 DIMENSIONAL]
• 2D name is given so because it has two axis X and Y.

Y-axis
The object can
be seen in only
one view
FRONT VIEW

0
X-axis

22-05-2025
PRINCIPAL COMPONENT ANALYSIS (PCA)

• TYPES OF 2D TRANSFORMATION

• It has 3 main types as mentioned below.

• 1. TRANSLATION

• 2. SCALING

22-05-2025
PRINCIPAL COMPONENT ANALYSIS (PCA)

• 1. TRANSLATION [T]
• It is a process of changing the position of object in a straight-
line path from one coordinate location to another

Y-axis
P’ (x’ , y’) x’ = x + tx
ty y’ = y + ty
The translation distance
P (x , y)
tx pair (tx,ty) is called as

0 translation vector or shift


22-05-2025
X-axis
vector
PRINCIPAL COMPONENT ANALYSIS (PCA)

• The above translation equations can be represented in single


matrix form.

P’ = P + T

𝑥 𝑥′ 𝑡𝑥
𝑃= 𝑦 𝑃′ = T=
𝑦′ 𝑡𝑦

22-05-2025
PRINCIPAL COMPONENT ANALYSIS (PCA)

• Example – Translate a polygon with co-ordinates A(2,5),


B(7,10) and C(10,2) by 3 units in x direction and 4 units in y
direction.
2 3 5
P’ = P + T 𝐴′ = 𝐴 + 𝑇 =
5
+
4
=
9

𝑥
𝑃= 𝑦 𝐵′ = 𝐵 + 𝑇 =
7
+
3
=
10
10 4 14

𝑡𝑥
T= 10 3 13
𝑡𝑦 𝐶′ = 𝐶 + 𝑇 =
2
+
4
=
6
22-05-2025
PRINCIPAL COMPONENT ANALYSIS (PCA)

• OUTPUT

Y-axis Y-axis

15 15 B’

B
10 10
A’
C’
5 A 5
C

X-axis X-axis
0 0
5 10 15 5 10 15

22-05-2025
PRINCIPAL COMPONENT ANALYSIS (PCA)

• 2. SCALING [S]
• This transformation changes the size of an object.

• This transformation can be carried out for polygons by


multiplying the coordinate values (x , y) of each vertex by
scaling factors SX and SY to produce the transformed
coordinates (x’ , y’).

x’ = x * SX y’ = y * SY
22-05-2025
PRINCIPAL COMPONENT ANALYSIS (PCA)

• The above scaling equations can be represented in single


matrix form.

P’ = P x S

𝑥 𝑥′ 𝑆𝑥 0
𝑃= 𝑦 𝑃′ = 𝑆=
𝑦′ 0 𝑆𝑦

22-05-2025
PRINCIPAL COMPONENT ANALYSIS (PCA)

• CONDITIONS FOR SCALING


• Any positive numeric values are valid for scaling factors Sx and
Sy

• Values less than 1 reduce the size of the object.

• Values greater than 1 enlarge the size of the object.

• For both Sx and Sy if the value is 1 then the size of the object
remains same.

• To get uniform scaling the values of Sx and Sy should be equal.


22-05-2025
PRINCIPAL COMPONENT ANALYSIS (PCA)

• Example – Scale the polygon with co-ordinates A(2,5), B(7,10)


and C(10,2) by 2 units in x direction and 2 units in y direction.

P’ = P x S

𝑥 𝑥′ 2 5 4 10
𝑦 2 0
𝑃= 𝑃′ = 𝑦 ′ 𝐴′ = 𝐴 ∗ 𝑆 = 7 10 = 14 20
0 2
10 2 20 4

𝑆𝑥 0
𝑆=
0 𝑆𝑦
22-05-2025
PRINCIPAL COMPONENT
ANALYSIS (PCA) EXAMPLE
TO SOLVE

119
PRINCIPAL COMPONENT ANALYSIS (PCA)

• Translation – Translate a polygon with co-ordinates A(4,6),


B(8,12) and C(12,4) by 4 units in x direction and 5 units in y
direction.

• Scaling – Scale the polygon with co-ordinates A(4,6), B(8,12)


and C(12,4) by 3 units in x direction and 3 units in y direction.

22-05-2025
LINEAR DISCRIMINANT
ANALYSIS (LDA)

121
LINEAR DISCRIMINANT ANALYSIS (LDA)

• Linear Discriminant Analysis (LDA) is one of the commonly


used dimensionality reduction techniques in machine
learning to solve more than two-class classification
problems.

22-05-2025
LINEAR DISCRIMINANT ANALYSIS (LDA)

• What is Linear Discriminant Analysis (LDA)?

• Linear Discriminant analysis is one of the most popular


dimensionality reduction techniques used for supervised
classification problems in machine learning.

• It is also considered a pre-processing step for modeling


differences in ML and applications of pattern classification.

22-05-2025
LINEAR DISCRIMINANT ANALYSIS (LDA)

• What is Linear Discriminant Analysis (LDA)?

• For e.g., if we have two classes with multiple features and


need to separate them efficiently.

• When we classify them using a single feature, then it may


show overlapping.

• To overcome the overlapping issue in the classification


process, we must increase the number of features regularly.

22-05-2025
LINEAR DISCRIMINANT ANALYSIS (LDA)

• EXAMPLE

• Let's assume we have to classify two different classes having


two sets of data points in a 2-dimensional plane as shown
below image:

22-05-2025
LINEAR DISCRIMINANT ANALYSIS (LDA)

• EXAMPLE

22-05-2025
LINEAR DISCRIMINANT ANALYSIS (LDA)

• To create a new axis, Linear Discriminant Analysis uses the


following criteria:

• It maximizes the distance between means of two classes.

• It minimizes the variance within the individual class.

22-05-2025
ANOMOLY DETECTION

128
ANOMOLY DETECTION

• Anomaly Detection is the technique of identifying rare


events or observations which can raise suspicions by being
statistically different from the rest of the observations.

• Such “anomalous” behavior typically translates to some kind


of a problem like a credit card fraud, failing machine in a
server, a cyber attack, etc.

22-05-2025
ANOMOLY DETECTION

• An anomaly can be broadly categorized into three categories –

• Point Anomaly: A tuple in a dataset is said to be a Point


Anomaly if it is far off from the rest of the data.

• Contextual Anomaly: An observation is a Contextual Anomaly


if it is an anomaly because of the context of the observation.

• Collective Anomaly: A set of data instances help in finding an


anomaly.

22-05-2025
ISOLATION FOREST

131
ISOLATION FOREST

• Isolation Forest is an unsupervised anomaly detection algorithm


particularly effective for high-dimensional data.

• It operates under the principle that anomalies are rare and distinct,
making them easier to isolate from the rest of the data.

• Unlike other methods that profile normal data, Isolation Forests


focus on isolating anomalies.

• At its core, the Isolation Forest algorithm, it banks on the


fundamental concept that anomalies, they deviate significantly,
thereby making them easier to identify.
22-05-2025
ISOLATION FOREST

• The workings of isolation forests are defined below:

• Building Isolation Trees: The algorithm starts by creating a set of


isolation trees, typically hundreds or even thousands of them.

• These trees are similar to traditional decision trees, but with a key
difference: they are not built to classify data points into specific
categories.

• Instead, isolation trees aim to isolate individual data points by


repeatedly splitting the data based on randomly chosen features
and split values.
22-05-2025
ISOLATION FOREST

• Splitting on Random Features: Isolation trees introduce


randomness at each node of the tree, a random feature from
the dataset is selected.

• Then, a random split value is chosen within the range of that


particular feature's values.

• This randomness helps ensure that anomalies, which tend to


be distinct from the majority of data points, are not hidden
within specific branches of the tree.
22-05-2025
ISOLATION FOREST

• Isolating Data Points: The data points are then directed


down the branches of the isolation tree based on their
feature values.
– If a data point's value for the chosen feature falls below the split
value, it goes to the left branch. Otherwise, it goes to the right
branch.

– This process continues recursively until the data point reaches a leaf
node, which simply represents the isolated data point.
22-05-2025
ISOLATION FOREST

• Anomaly Score: The key concept behind Isolation Forests lies


in the path length of a data point through an isolation tree.
– Anomalies, by virtue of being different from the majority, tend to be
easier to isolate. They require fewer random splits to reach a leaf node
because they are likely to fall outside the typical range of values for
the chosen features.

– Conversely, normal data points, which share more similarities with


each other, might require more splits on their path down the tree
before they are isolated.

22-05-2025
ISOLATION FOREST

• Anomaly Score Calculation: Each data point is evaluated


through all the isolation trees in the forest.
– For each tree, the path length (number of splits) required to isolate
the data point is recorded.

– An anomaly score is then calculated for each data point by averaging


the path lengths across all the isolation trees in the forest.

22-05-2025
ISOLATION FOREST

• Identifying Anomalies: Data points with shorter average path


lengths are considered more likely to be anomalies.

• This is because they were easier to isolate, suggesting they


deviate significantly from the bulk of the data.

• A threshold is set to define the anomaly score that separates


normal data points from anomalies.
22-05-2025
139
ISOLATION FOREST
ANOMOLY DETECTION
METHODS

140
ISOLATION FOREST

• OUTLIER (Anomaly)

• These are the values in dataset which standout from the rest
of the data.

• It can be a result of error in reading, fault in the system,


manual error or misleading.

• 1. Interquartile Range (IQR)

• 2. Z- Score

22-05-2025
IQR SCORE ANOMOLY
DETECTION METHOD

142
ISOLATION FOREST

• IQR EXAMPLE

• Consider below dataset of 15 points

• STEP – 1 : Arrange the data(n) in ascending order.

22-05-2025
ISOLATION FOREST

• IQR EXAMPLE

• STEP – 2 : Calculate the quartiles and IQR

22-05-2025
ISOLATION FOREST

• IQR EXAMPLE

• STEP – 3 : Define threshold

22-05-2025
ISOLATION FOREST

• IQR EXAMPLE

• STEP – 4 : Identify outliers

• Any point less than -Q1(-19) threshold or greater than


+Q3(+64) threshold is considered an outlier

• We will remove the first and last point in this dataset (-50 & 1456)
22-05-2025
Z-SCORE ANOMOLY
DETECTION METHOD

147
ISOLATION FOREST

• Z SCORE EXAMPLE

• Consider the dataset of daily sales revenue of 30 days

• We need to identify any days where the sales revenue is


significantly different from the other days, which may indicate
an anomaly or outlier.
22-05-2025
ISOLATION FOREST

• Z SCORE EXAMPLE

• STEP-1: Calculate mean and standard deviation.

22-05-2025
ISOLATION FOREST

• Z SCORE EXAMPLE

• STEP-2: Calculate z-scores.

22-05-2025
ISOLATION FOREST

• Z SCORE EXAMPLE

• STEP-3: Set a threshold.

22-05-2025
ISOLATION FOREST

• Z SCORE EXAMPLE

• STEP-4: Identify anomalies.

• No outliers found in the given data.

22-05-2025
IQR & Z-SCORE ANOMOLY
DETECTION EXAMPLES

153
ISOLATION FOREST

• IQR EXAMPLE

• Consider following dataset and find out the outliers

• [10,20,15,-30,22,60,45,-24,55,81]

• Z SCORE EXAMPLE

• Consider following dataset of daily transaction and find out


the outliers. Consider the Z score = 5

• [100,150,120,200,250,101,230,330,450,500]
22-05-2025

You might also like