Data Analytics (IT-3006)
Kalinga Institute of Industrial Technology
             Deemed to be University
              Bhubaneswar-751024
        School of Computer Engineering
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
3 Credit                                           Lecture Note – Unit 5
        Course Contents
2
    Sr # Major and Detailed Coverage Area                  Hrs
    5    Classification & Clustering                       7
         Introduction to classification and clustering,
         Distance-Based Algorithms: K Nearest Neighbor
         (KNN), Decision Tree-Based Algorithms: Decision
         Tree (ID3 Algorithm), Support Vector Machines
         (Linear), Naves Bayes.
         Overview of Clustering Techniques, Hierarchical
         Clustering, Partitioning Methods, K-Means
         Algorithm.
                        School of Computer Engineering
3
    School of Computer Engineering
Classification
4
   Classification is a task that involves separating data points into different classes i.e.,
    assigning a class label to each data point instance in a dataset based on its features.
    The goal of classification is to build a model that accurately predicts the class labels
    of new instances based on their features.
   There are two main types of classification: binary classification and multi-class
    classification. Binary classification involves classifying instances into two classes,
    such as “spam” or “not spam”, while multi-class classification involves classifying
    instances into more than two classes.
                           School of Computer Engineering
Clustering
5
   Cluster is a group of objects that belongs to the same class. In other words, similar
    objects are grouped in one cluster and dissimilar objects are grouped in another
    cluster.
   In clustering (also called cluster analysis), a group of different data objects is
    classified as similar objects. One group means a cluster of data. Data sets are
    divided into different groups in the cluster analysis, which is based on the
    similarity of the data.
   After the classification of data into various groups, a label is assigned to the group.
    It helps in adapting to the changes by doing the classification.
                          School of Computer Engineering
Classification and Clustering
6
 Both Classification and Clustering is used for the categorization of objects into
  one or more classes based on the features. They appear to be a similar process as
  the basic difference is minute.
 In the case of Classification, there are predefined labels assigned to each input
  instance according to their features whereas in clustering those labels are missing.
    Parameter         Classification                      Clustering
    Basic             Process of classifying the input Grouping the instances based on
                      instances    based       on their their similarity without the help
                      corresponding class labels        of class labels
    Need              It has labels so there is need of There is no need of training and
                      training and testing dataset for testing dataset
                      verifying the model created
    Complexity        More complex as compared to         Less complex as compared to
                      clustering                          classification
    Prior knowledge   Yes (The number of classes are      No (The number of classes is
    of classes        known)                              unknown)
                          School of Computer Engineering
Classification and Clustering Example
7
       Predefined classes        Identifies similarities among objects
               School of Computer Engineering
8
    School of Computer Engineering
The distance-based algorithms
9
   The distance-based algorithms are used to measure the distance between data
    points.
   Distance measures play an important role in classification and clustering i.e., it
    provide the foundations for many popular and effective algorithms like KNN (K-
    Nearest Neighbors) for classifications and K-Means clustering for clustering.
   Different distance measures can be chosen and used depending on the types of
    data. A distance measure is an objective score that summarizes the relative
    difference between two objects in a problem domain. Most commonly, the two
    objects are rows of data that describes a subject (such as a person, car, or house),
    or an event (such as purchases, a claim, or a diagnosis)
   If the distance among two data points is small then there is a high degree of
    similarity among the objects and vice versa.
                          School of Computer Engineering
Distance Measurement Example
10
         School of Computer Engineering
Distance Measure
11
     From these Euclidean distance is more preferable as it gives equal important to
                                      each feature.
                          School of Computer Engineering
KNN
12
    The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric and lazy
     learning algorithm, which uses proximity to make classifications about the grouping of an
     individual data point.
         Lazy learning algorithm − It does not have a specialized training phase and uses all the
          data for training while classification.
         Non-parametric learning algorithm − It doesn’t assume anything about the underlying
          data.
    The k value in the k-NN algorithm defines how many neighbors will be checked to determine
     the classification of a specific query point.
    If k is set to 1, the instance will be assigned to the same class as its single nearest neighbor. If k
     is set to 5, the classes of 5 closest points are checked.
    Defining k can be a balancing act as different values can lead to over fitting or under fitting.
     Lower values of k can have high variance, but low bias, and larger values of k may lead to high
     bias and lower variance.
    The choice of k will largely depend on the input data as data with more outliers or noise will
     likely perform better with higher values of k.
    Overall, it is recommended to have an odd number for k to avoid ties in classification, and
     cross-validation tactics can help to choose the optimal k for the dataset.
                               School of Computer Engineering
Working of KNN
13
        School of Computer Engineering
Working of KNN Example
14
Consider a new data point that need to
   put it in the required category
                                             1. Firstly, choose the number of
                                             neighbors, so let k=5.
                                             2. Next, calculate the Euclidean
                                             distance between the data points.
                                             The Euclidean distance is the
                                             distance between two points.
                                             3. By calculating the Euclidean
                                             distance, the nearest neighbors
                                             are derived i.e., three nearest
                                             neighbors in category A and two
                                             nearest neighbors in category B.
                         School of Computer Engineering
Working of KNN Example cont…
15
                             As 3 nearest neighbors are from
                             category A, hence this new data
                             point must belong to category A.
        School of Computer Engineering
KNN Example
16
Let given is the data from the questionnaires survey with two attributes (X1: Acid
durability and X2: Strength) to classify whether a special paper tissue is good or not.
Classify the new paper tissue with features: X1=3 and X2=7.
           X1: Acid Durability          X2: Strength            Y: Class
                     7                        7                   Bad
                     7                        4                   Bad
                     3                        4                  Good
                     1                        4                  Good
                          School of Computer Engineering
KNN Solution
17
Step 1: Initialize and Define K
 K=1 Over fitting Problem
 Number of tuples is even, hence consider K as odd.
 Let K = 3 for this problem
Step 2: Compute distance between input sample and training samples using Euclidian
distance measure.
     X1: Acid Durability    X2: Strength    Y: Class     Distance from (3, 7)
             7                    7           Bad       sqrt((7-3)2 + (7-7)2)=4
             7                    4           Bad       sqrt((7-3)2 + (4-7)2)=5
             3                    4          Good       sqrt((3-3)2 + (4-7)2)=3
             1                    4          Good      sqrt((1-3)2 + (4-7)2)=3.6
                            School of Computer Engineering
KNN Solution cont…
18
Step 3: Determine K nearest neighbor those are at minimum distance and rank them.
           X1:         X2: Strength   Y: Class      Distance from        Rank
     Acid Durability                                    (3, 7)
           7                7           Bad               4                3
           7                4           Bad               5                4
           3                4           Good              3                1
           1                4           Good             3.6               2
 Step 4: Apply Simple Majority
 There are 2 “Good” and 1 “Bad”. Thus the new paper tissue will be classified as Good
 due to simple majority.
                           School of Computer Engineering
KNN Exercise
19
Given is the customer detail as customer’s age, income, credit card no. and their
corresponding class. Find the class label for the new data item (reflecting in red font):
Customer: RRR, Age: 37, Income: 50K, Cr Card: 2, Class- ?
     Customer        Age           Income          No. of Credit           Class
                                                      Cards
       XXX           35              35K                    3               No
       YYY           22              50K                    2               Yes
       ZZZ           63             200K                    1               No
       PPP           59             170K                    1               No
       QQQ           25              40K                    4               Yes
       RRR           37              50K                    2               ????
                           School of Computer Engineering
20
     School of Computer Engineering
Decision Tree
21
    A decision tree is a structure that contains nodes (rectangular boxes) and edges(arrows) and is built from a
     dataset.
    Each node is either used to make a decision (known as decision node) or represent an outcome (known as
     leaf node).
    The picture depicts below represent a decision tree that is used to classify whether a person is Fit or Unfit.
    The decision nodes here are questions like ‘Is the person less than 30 years of age?’, ‘Does the person eat
     junk?’, etc. and the leaves are one of the two possible outcomes viz. Fit and Unfit.
    Looking at the Decision Tree, following decisions can be looked at: if a person is less than 30 years of age
     and doesn’t eat junk food then he is Fit, if a person is less than 30 years of age and eats junk food then he is
     Unfit and so on.
    The initial node is called the root node (colored in blue), the final nodes are called the leaf nodes (colored
     in green) and the rest of the nodes are called intermediate or internal nodes. The root and intermediate
     nodes represent the decisions while the leaf nodes represent the outcomes.
                                  School of Computer Engineering
Decision Tree-Based Algorithms
22
     Dataset
               School of Computer Engineering
Decision Tree-Based Algorithms cont…
23
 Decision Tree-based algorithms
                        School of Computer Engineering
ID3 Algorithm
24
 ID3 stands for Iterative Dichotomiser 3 and is named such because the
  algorithm iteratively (i.e., repeatedly) dichotomizes(i.e., divides) features
  into two or more groups at each step.
 It is a classification algorithm that follows a top-down greedy approach by
  selecting a best features that yields maximum Information Gain(IG) or
  minimum Entropy(H).
 The top-down approach means start building the tree from the top and the
  greedy approach means that at each iteration, select the best feature at the
  present moment to create a node.
 Entropy is a measure of the amount of uncertainty in the dataset S i.e.,
   H(S) = ∑c ∈ C − p (c) log2 p(c), where:
   S - The current dataset for which entropy is being calculated.
   C - Set of classes in S {example of C ={yes, no}}
   p(c) - The proportion of the number of elements in class c to the number
   of elements in set S.
                       School of Computer Engineering
ID3 Algorithm cont…
25
Entropy calculation for the following sample (S):
          Sample class    Sample count
          Yes             9
          No              5
C ={Yes, No}
Proportion of the number of elements in class Yes i.e., p(Yes) = 9/(9+5) =9/14
Proportion of the number of elements in class No i.e., p(No) = 5/(9+5) =5/14
H(S) = -(9/14 * log₂(9/14) + (5/14) * log₂(5/14)) = 0.94
                         School of Computer Engineering
Information Gain
26
Information Gain IG(A, S) tells how much uncertainty in the dataset S was
reduced after splitting S on feature A i.e.,
    IG(A , S) = H(S) − I(A) where I(A) = ∑ t∈T p(t) * H(t) and
    H(S) - Entropy of dataset S
    T - The subsets created from splitting dataset S by feature A
    p(t) - The proportion of the number of elements in t to the number of
    elements in dataset S.
    H(t) - Entropy of subset t.
    I(A) - Average Entropy Information for feature A
                     School of Computer Engineering
Information Gain Calculation
27
                              Sr No   Regular           Need Tutor
                              1       Yes               Yes
                              2       Yes               No
                              3       Yes               Yes
Consider the dataset          4       Yes               No
for prediction of
                              5       No                Yes
student needs tutor
or not.                       6       No                No
                              7       No                Yes
                              8       No                No
                              9       Yes               No
                              10      No                No
                              11      No                No
                              12      No                No
                       School of Computer Engineering
Information Gain Calculation
28
Complete entropy of dataset is (4 out of 12 are Yes and 8 out of 12 are No) -
H(S) = - p(YES) * log2(p(YES)) - p(no) * log2(p(no))
     = - (4/12) * log2(4/12) - (8/12) * log2(8/12)
     = - (-0.527) - (-0.395)
     = 0.922
Information Gain for Regular:
 Sr No     Regular          Need Tutor                             Sr No      Regular         Need Tutor
                                                                   5          No              Yes
 1         Yes              Yes
                                                                   6          No              No
 2         Yes              No
                                                                   7          No              Yes
 3         Yes              Yes                                    8          No              No
 4         Yes              No                                     10         No              No
                                                                   11         No              No
 9         Yes              No
                                                                   12         No              No
 5 rows with Yes value out of 12 for which
 there are 2 Yes in the target column and 3 No   7 rows with No value out of 12 for which there are 2 Yes in the target
 in the target column.                           column and 5 No in the target column.
                                     School of Computer Engineering
Information Gain Calculation
29
H(regular=YES) = -(2/5)*log2(2/5)-(3/5)*log2(3/5)
               = - (- 0.5288) - (- 0.4422) = 0.971
H(regular=NO) = -(2/7)*log2(2/7)-(5/7)*log2(5/7)
              = -(-0.516) - (- 0.347) = 0.863
I(Regular) = p(YES) * H(Regular=YES) + p(NO) * H(Regular=No)
           = (5/12)*0.971 + (7/12)*0.863 = 0.907
Information Gain (S, Regular)= H(S) - I(Regular)
                             = 0.983 - 0.907 = 0.076
                       School of Computer Engineering
Steps in ID3 algorithm
30
1.   Calculate entropy for dataset.
2.   For each feature:
     2.1. Calculate entropy for all its categorical values.
     2.2. Calculate information gain for the feature.
3.   Find the feature with maximum information gain.
4.   Repeat it until desired tree is obtained.
                         School of Computer Engineering
ID3 Example
31
                       Day   Outlook    Temperature   Humidity   Wind     Play Tennis
                       1     Sunny      Hot           High       Weak     No
                       2     Sunny      Hot           High       Strong   No
                       3     Overcast   Hot           High       Weak     Yes
                       4     Rain       Mild          High       Weak     Yes
                       5     Rain       Cool          Normal     Weak     Yes
Using ID3 algorithm    6     Rain       Cool          Normal     Strong   No
to build a Decision    7     Overcast   Cool          Normal     Strong   Yes
Tree to predict the
                       8     Sunny      Mild          High       Weak     No
weather
                       9     Sunny      Cool          Normal     Weak     Yes
                       10    Rain       Mild          Normal     Weak     Yes
                       11    Sunny      Mild          Normal     Strong   Yes
                       12    Overcast   Mild          High       Strong   Yes
                       13    Overcast   Hot           Normal     Weak     Yes
                       14    Rain       Mild          High       Strong   No
                      School of Computer Engineering
ID3 Example cont…
32
Step 1: Entropy of dataset is:
H(S) = - p(Yes) * log2(p(Yes)) - p(No) * log2(p(No))
     = - (9/14) * log2(9/14) - (5/14) * log2(5/14)
     = - (-0.41) - (-0.53) = 0.94
Step 2: Calculate entropy for all its categorical values and information gain for the
features.
First feature – Outlook
Categorical values - sunny, overcast and rain
H(Outlook=sunny) = -(2/5)*log(2/5)-(3/5)*log(3/5) =0.971
H(Outlook=rain) = -(3/5)*log(3/5)-(2/5)*log(2/5) =0.971
H(Outlook=overcast) = -(4/4)*log(4/4)-0 = 0
Average Entropy Information for Outlook-
I(Outlook) = p(sunny) * H(Outlook=sunny) + p(rain) * H(Outlook=rain) + p(overcast) *
H(Outlook=overcast) = (5/14)*0.971 + (5/14)*0.971 + (4/14)*0 = 0.693
Information Gain(S, Outlook) = H(S) - I(Outlook) = 0.94 - 0.693 = 0.247
                          School of Computer Engineering
ID3 Example cont…
33
Step 2: Calculate entropy for all its categorical values and information gain for the
features.
Second feature – Temperature
Categorical values - hot, mild, cool
H(Temperature=hot) = -(2/4)*log(2/4)-(2/4)*log(2/4) = 1
H(Temperature=cool) = -(3/4)*log(3/4)-(1/4)*log(1/4) = 0.811
H(Temperature=mild) = -(4/6)*log(4/6)-(2/6)*log(2/6) = 0.9179
Average Entropy Information for Temperature -
I(Temperature) = p(hot)*H(Temperature=hot) + p(mild)*H(Temperature=mild) +
p(cool)*H(Temperature=cool) = (4/14)*1 + (6/14)*0.9179 + (4/14)*0.811 = 0.9108
Information Gain(S, Temperature) = H(S) - I(Temperature) = 0.94 - 0.9108 = 0.0292
                         School of Computer Engineering
ID3 Example cont…
34
Step 2: Calculate entropy for all its categorical values and information gain for the
features.
Third feature - Humidity
Categorical values - high, normal
H(Humidity=high) = -(3/7)*log(3/7)-(4/7)*log(4/7) = 0.983
H(Humidity=normal) = -(6/7)*log(6/7)-(1/7)*log(1/7) = 0.591
Average Entropy Information for Humidity -
I(Humidity) = p(high)*H(Humidity=high) + p(normal)*H(Humidity=normal)
= (7/14)*0.983 + (7/14)*0.591 = 0.787
Information Gain (S, Humidity) = H(S) - I(Humidity) = 0.94 - 0.787 = 0.153
                          School of Computer Engineering
ID3 Example cont…
35
Step 2: Calculate entropy for all its categorical values and information gain for the
features.
Fourth feature – Wind
Categorical values - weak, strong
H(Wind=weak) = -(6/8)*log(6/8)-(2/8)*log(2/8) = 0.811
H(Wind=strong) = -(3/6)*log(3/6)-(3/6)*log(3/6) = 1
Average Entropy Information for Wind -
I(Wind) = p(weak)*H(Wind=weak) + p(strong)*H(Wind=strong)
= (8/14)*0.811 + (6/14)*1 = 0.892
Information Gain (S, Wind) = H(S) - I(Wind) = 0.94 - 0.892 = 0.048
                          School of Computer Engineering
ID3 Example cont…
36
Step 3: Find the feature with maximum information gain.
Here, the feature with maximum information gain is Outlook. So, the decision tree built
so far is as follows.
Here, when Outlook = overcast, it is of pure class(Yes).
Now, the same procedure to be repeated for the data with rows consist of Outlook value
as Sunny and then for Outlook value as Rain.
                          School of Computer Engineering
ID3 Example cont…
37
Step 4: The same process to be repeated until the tree is obtained
Now, finding the best feature for splitting the data with Outlook=Sunny values
Entropy of Sunny is -
H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))
     = - (2/5) * log2(2/5) - (3/5) * log2(3/5)
     = 0.971
First feature – Temperature with categorical values - hot, mild, cool
H(Sunny, Temperature=hot) = -0-(2/2)*log(2/2) = 0
H(Sunny, Temperature=cool) = -(1)*log(1)- 0 = 0
H(Sunny, Temperature=mild) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1
Average Entropy Information for Temperature -
I(Sunny, Temperature) = p(Sunny, hot)*H(Sunny, Temperature=hot) + p(Sunny,
mild)*H(Sunny, Temperature=mild) + p(Sunny, cool)*H(Sunny, Temperature=cool) =
(2/5)*0 + (1/5)*0 + (2/5)*1 = 0.4
Information Gain (Sunny, Temperature) = H(Sunny) - I(Sunny, Temperature) = 0.971 - 0.4
= 0.571
                          School of Computer Engineering
ID3 Example cont…
38
Step 4: The same process to be repeated until the tree is obtained
Second feature - Humidity with categorical values - high, normal
H(Sunny, Humidity=high) = - 0 - (3/3)*log(3/3) = 0
H(Sunny, Humidity=normal) = -(2/2)*log(2/2)-0 = 0
Average Entropy Information for Humidity -
I(Sunny, Humidity) = p(Sunny, high)*H(Sunny, Humidity=high)               +   p(Sunny,
normal)*H(Sunny, Humidity=normal) = (3/5)*0 + (2/5)*0 = 0
Information Gain (Sunny, Humidity)= H(Sunny) - I(Sunny, Humidity) = 0.971 - 0 = 0.971
                          School of Computer Engineering
ID3 Example cont…
39
Step 4: The same process to be repeated until the tree is obtained
Third feature - Wind with categorical values - weak, strong
H(Sunny, Wind=weak) = -(1/3)*log(1/3)-(2/3)*log(2/3) = 0.918
H(Sunny, Wind=strong) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1
Average Entropy Information for Wind -
I(Sunny, Wind) = p(Sunny, weak)*H(Sunny, Wind=weak) + p(Sunny, strong)*H(Sunny,
Wind=strong) = (3/5)*0.918 + (2/5)*1 = 0.9508
Information Gain (Sunny, Wind) = H(Sunny) - I(Sunny, Wind) = 0.971 - 0.9508 = 0.0202
                          School of Computer Engineering
ID3 Example cont…
40
Step 4: The same process to be repeated until the tree is obtained.
The feature with maximum information gain is Humidity. So, the decision tree built so far
is as follows. Here, when Outlook = Sunny and Humidity = High, it is a pure class of
category "no". And When Outlook = Sunny and Humidity = Normal, it is again a pure class
of category "yes". Therefore, we don't need to do further calculations.
                          School of Computer Engineering
ID3 Example cont…
41
Step 4: The same process to be repeated until the tree is obtained.
Now, finding the best feature for splitting the data with Outlook=Rain values
Entropy of Rain is -
H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))
     = - (3/5) * log(3/5) - (2/5) * log(2/5)
     = 0.971
First feature – Temperature with categorical values - mild, cool
H(Rain, Temperature=cool) = -(1/2)*log(1/2)- (1/2)*log(1/2) = 1
H(Rain, Temperature=mild) = -(2/3)*log(2/3)-(1/3)*log(1/3) = 0.918
Average Entropy Information for Temperature -
I(Rain, Temperature) = p(Rain, mild)*H(Rain, Temperature=mild)                  +   p(Rain,
cool)*H(Rain, Temperature=cool) = (2/5)*1 + (3/5)*0.918 = 0.9508
Information Gain (Rain, Temperature) = H(Rain) - I(Rain, Temperature) = 0.971 - 0.9508
= 0.0202
                           School of Computer Engineering
ID3 Example cont…
42
Step 4: The same process to be repeated until the tree is obtained.
Second feature - Wind with categorical values - weak, strong
H(Wind=weak) = -(3/3)*log(3/3)-0 = 0
H(Wind=strong) = 0-(2/2)*log(2/2) = 0
Average Entropy Information for Wind -
I(Wind) = p(Rain, weak)*H(Rain, Wind=weak) + p(Rain, strong)*H(Rain, Wind=strong)
= (3/5)*0 + (2/5)*0 = 0
Information Gain (Rain, Wind)= H(Rain) - I(Rain, Wind) = 0.971 – 0 = 0.971
                           School of Computer Engineering
ID3 Example cont…
43
Step 4: The same process to be repeated until the tree is obtained.
Here, the feature with maximum information gain is Wind. So, the decision tree built so
far is as follows. Here, when Outlook = Rain and Wind = Strong, it is a pure class of
category "no". When Outlook = Rain and Wind = Weak, it is again a pure class of category
"yes". This is final desired tree for the given dataset.
                           School of Computer Engineering
ID3 Class Exercise
44
                       Age         Income   Student    Credit Rating   Buy Computer
                       <=30        High     No         Fair            No
                       <=30        High     No         Excellent       No
                       31 ... 40   High     No         Fair            Yes
                       >40         Medium   No         Fair            Yes
                       >40         Low      Yes        Fair            No
Use ID3 algorithm      >40         Low      Yes        Excellent       Yes
to build a Decision    31 ... 40   Low      Yes        Excellent       Yes
Tree to predict buy
                       <=30        Medium   No         Fair            No
computer
                       <=30        Low      Yes        Fair            Yes
                       >40         Medium   Yes        Fair            Yes
                       <=30        Medium   Yes        Excellent       Yes
                       31 ... 40   Medium   No         Excellent       Yes
                       31 ... 40   High     Yes        Fair            Yes
                       >40         Medium   No         Excellent       No
                      School of Computer Engineering
45
     School of Computer Engineering
Support Vector Machines
46
Support Vector Machine (SVM) is an algorithm which can be used for both classification
or regression challenges. In the SVM algorithm, each data item is plotted as a point in n-
dimensional space (where n is number of features) with the value of each feature being
the value of a particular coordinate. Then, classification is performed by finding the
hyperplane that differentiates the two classes.
What is hyperplane?
A hyperplane is a generalization of a plane.
   In one dimension, it is called a point.
   In two dimensions, it is a line.
   In three dimensions, it is a plane.
   In more dimensions one can call it an hyperplane.
The following figure represents datapoint in one dimension and the point L is a
separating hyperplane.
                           School of Computer Engineering
SVM cont…
47
How does SVM works?
    Let’s imagine we have two tags: red and blue, and our
     data has two features: x and y. We want a classifier that,
     given a pair of (x, y) coordinates, outputs if it’s either red
     or blue. We plot our already labeled training data on a
     plane.
 A support vector machine takes these data points and
  outputs the hyperplane (which in two dimensions it’s
  simply a line) that best separates the tags. This line is the
  decision boundary: anything that falls to one side of it
  we will classify as blue, and anything that falls to the
  other as red.
 In 2D, the best hyperplane is simply a line.
                            School of Computer Engineering
SVM cont…
48
But, what exactly is the best hyperplane? For SVM, it’s the one that maximizes the
margins from both tags. In other words: the hyperplane (in 2D, it's a line) whose
distance to the nearest element of each tag is the largest.
                        School of Computer Engineering
SVM cont…
49
(Scenario-1) Identification of the right hyperplane: Here, we have three hyperplanes (A,
B and C). Now, the job is to identify the right hyperplane to classify star and circle.
Remember a thumb rule - identify the right hyperplane: “Select the hyperplane which
segregates the two classes better”. In this scenario, hyperplane “B” has excellently
performed this job.
                          School of Computer Engineering
SVM cont…
50
(Scenario-2) Identify the right hyperplane: Here, we have three hyperplanes (A, B and
C) and all are segregating the classes well. Now, How can we identify the right
hyperplane?
Here, maximizing the distances between nearest data point (either class) and hyperplane
will help to decide the right hyperplane. This distance is called as margin.
                         School of Computer Engineering
SVM cont…
51
Below, you can see that the margin for hyperplane C is high as compared to both A and B.
Hence, we name the right hyperplane as C. Another lightning reason for selecting the
hyperplane with higher margin is robustness. If we select a hyperplane having low
margin then there is high chance of misclassification.
                          School of Computer Engineering
SVM cont…
52
(Scenario-3) Identify the right hyperplane: Use the rules as discussed in previous
section to identify the right hyperplane.
Some of you may have selected the hyperplane B as it has higher margin compared to A.
But, here is the catch, SVM selects the hyperplane which classifies the classes accurately
prior to maximizing margin. Here, hyperplane B has a classification error and A has
classified all correctly. Therefore, the right hyperplane is A.
                          School of Computer Engineering
SVM cont…
53
(Scenario-4) Below, we are unable to segregate the two classes using a straight line, as
one of the stars lies in the territory of other(circle) class as an outlier.
 One star at other end is like an outlier for star class. The SVM algorithm has a feature to
 ignore outliers and find the hyperplane that has the maximum margin. Hence, we can say,
 SVM classification is robust to outliers.
                           School of Computer Engineering
SVM cont…
54
(Scenario-5) Find the hyperplane to segregate to
classes: In the scenario, we can’t have linear
hyperplane between the two classes, so how does
SVM classify these two classes? Till now, we have
only looked at the linear hyperplane.
 SVM can solve this problem. Easily! It solves this
 problem by introducing additional feature. Here,
 we can add a new feature z=x2+y2. Now, plotting
 the data points on axis x and z:
     In above plot, points to consider are:
      All values for z would be positive as z is the squared sum of both x and y.
      In the original plot, red circles appear close to the origin of x and y axes,
         leading to lower value of z and star relatively away from the origin result to
         higher value of z.
                            School of Computer Engineering
SVM cont…
55
 In the SVM classifier, it is easy to have a linear
  hyperplane between these two classes. But, another
  burning question which arises is, should we need to
  add this feature manually to have a hyperplane. No,
  the SVM algorithm has a technique called the
  kernel trick. The SVM kernel takes low dimensional
  input space and transforms it to a higher
  dimensional space i.e., it converts non-separable
  problem to separable problem. It is mostly useful in
  non-linear separation problem. Simply put, it does
  some extremely complex data transformations, then
  finds out the process to separate the data based on
  the labels or outputs you’ve defined.
 When we look at the hyperplane in original input
  space it looks like a circle.
                       School of Computer Engineering
Types of SVM
56
 Linear SVM: Linear SVM is used for data that are linearly separable i.e. for a
  dataset that can be categorized into two categories by utilizing a single
  straight line. Such data points are termed as linearly separable data, and the
  classifier is used described as a linear SVM classifier.
 Non-linear SVM: Non-Linear SVM is used for data that are non-linearly
  separable data i.e. a straight line cannot be used to classify the dataset. For
  this, we use something known as a kernel trick that sets data points in a
  higher dimension where they can be separated using planes or other
  mathematical functions. Such data points are termed as non-linear data, and
  the classifier used is termed as a non-linear SVM classifier.
                        School of Computer Engineering
57
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described
as:
 Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
    independent of the occurrence of other features. Such as if the fruit is identified on the bases
    of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
    each feature individually contributes to identify that it is an apple without depending on each
    other.
 Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
                              School of Computer Engineering
Naïve Bayes Classifier
58
 Naive Bayes classifier (NBC)        DAY
  is a classification algorithm. It
  share a principle, i.e. every
  pair of features being classified
  is independent of each other.
 Consider the problem of
  playing golf, and the dataset is
  shown on right.
 The need is to classify
  whether the day is suitable for
  playing golf, given the features
  of the day. The columns
  represent these features and
  the rows represent individual
  observations.
                             School of Computer Engineering
Naïve Bayes Classifier cont…
59
    If we take the first observation of the dataset, we can observe that is not suitable for playing golf
     if the outlook is rainy, temperature is hot, humidity is high and it is not windy.
    We make two assumptions. The first, these predictors are independent i.e. if the temperature is
     hot, it does not necessarily mean that the humidity is high. Another assumption is that all the
     predictors have an equal effect on the outcome i.e. the day being windy does not have more
     importance in deciding to play golf or not.
    The Bayes theorem can be written as P(y|X) = (P(X|y) * P(y)) / P(X) where the variable y is the
     dependent variable(play golf), which represents if it is suitable to play golf or not given the
     conditions. X represent the independent variables (outlook, temperature, humidity and windy).
    X is given as X = (x1, x2, …., xn) where represent the feature and mapped to outlook, temperature,
     humidity and windy.
    By substituting for X and expanding using the chain rule we get:
    Now, the values for each can be obtained by looking at the dataset and substitute them into the
     equation. For all entries in the dataset, the denominator does not change, it remain static.
Note: Bayes Theorem describes the probability of an event, based on prior knowledge of conditions
that might be related to the event.
                               School of Computer Engineering
Naïve Bayes Classifier cont…
60
Using the above function, we can obtain the class, given the predictors.
 P(Y) = 9/ 14 and P(N) = 5/14 where Y stands for Yes and N stands for No.
 The outlook probability is: P(sunny | Y) = 2/9, P(overcast | Y) = 4/9, P(rain | Y) = 3/9,
   P(sunny | N) = 3/5, P(overcast | N) = 0, P(rain | N) = 2/5
 The temperature probability is: P(hot | Y) = 2/9, P(mild | Y) = 4/9, P(cool | Y) = 3/9,
   P(hot | N) = 2/5, P(mild | N) = 2/5, P(cool | N) = 1/5
 The humidity probability is: P(high | Y) = 3/9, P(normal | Y) = 6/9, P(high | N) = 4/5,
   P(normal | N) = 2/5.
 The windy probability is: P(true | Y) = 3/9, P(false | Y) = 6/9, P(true | N) = 3/5, P(false
   | N) = 2/5
Now it is to predict “play golf” today with the conditions: <outlook = sunny; temperature
= cool; humidity = high; windy = true>, so today = (sunny, cool, high, true)
P(Y | today) = P(sunny | Y) * P(cool | Y) * P(high | Y) * P(true | Y) * P(Yes) = 2/9 * 3/9 *
3/9 * 3/9 * 9/ 14 = 0.005
P(N | today) = P(sunny | N) * P(cool | N) * P(high | N) * P(true | N) * P(N) = 3/5 * 1/5 *
4/5 * 3/5 * 5/14 = 0.020
Since, the probability of No is the larger, we can predict today to play golf be No.
                           School of Computer Engineering
Types of Naïve Bayes Classifier
61
 Multinomial Naive Bayes: This is mostly used for document classification
  problem, i.e. whether a document belongs to the category of sports, politics,
  technology etc. The features/predictors used by the classifier are the
  frequency of the words present in the document.
 Bernoulli Naive Bayes: This is similar to the multinomial naive bayes but
  the predictors are boolean variables. The parameters that we use to predict
  the class variable take up only values yes or no, for example if a word occurs
  in the text or not.
 Gaussian Naive Bayes: The predictors take up a continuous value and are
  not discrete.
                       School of Computer Engineering
Naïve Bayes Classifier cont…
62
Pros
   It is easy and fast to predict class of test data set. It also perform well in multi class
    prediction.
  When assumption of independence holds, a Naive Bayes classifier performs better
    compare to other models like logistic regression and you need less training data.
  It perform well in case of categorical input variables compared to numerical
    variable(s). For numerical variable, normal distribution is assumed (bell curve, which
    is a strong assumption).
 Cons
  The assumption of independent predictors. In real life, it is almost impossible to get a
   set of predictors which are completely independent.
  If categorical variable has a category (in test data set), which was not observed in
   training data set, then model will assign a 0 (zero) probability and will be unable to
   make a prediction. This is often known as “Zero Frequency”. To solve this, we can use
   the smoothing technique. One of the simplest smoothing techniques is called Laplace
   estimation.
                           School of Computer Engineering
Class Exercise
63
     DAY
                                   Use the Naïve Bayes Classifier
                                   to determine whether a day
                                   (Sunny, Hot, Normal, False)
                                   should be classified as a
                                   defaulted to play golf or not.
           School of Computer Engineering
Portion
64
          School of Computer Engineering
Clustering
65
 Clustering is the task of dividing the data points into a number of groups such
  that data points in the same groups are more similar to other data points in
  the same group and dissimilar to the data points in other groups.
 It is basically a collection of objects on the basis of similarity and
  dissimilarity between them.
 Following is an example of finding clusters of data points based on their
  income and debt.
                        School of Computer Engineering
Clustering Example
66
Imagine of a number of objects in a basket. Each item has a distinct set of
features (size, shape, color, etc.). Now the task is to group each of the objects in
the basket. A natural first question to ask is, ‘on what basis these objects should
be grouped?’ Perhaps size, shape, or color.
                         School of Computer Engineering
Clustering Example cont…
67
The data points clustered together can be classified into one single group. The
clusters can be distinguished, and can identify that there are 3 clusters.
                       School of Computer Engineering
Clustering Types
68
Clustering can be divided into two subgroups:
 Hard Clustering: each data point either belongs to a cluster completely or
   not.
 Soft Clustering: Instead of putting each data point into a separate cluster, a
   probability or likelihood of that data point to be in those clusters is assigned.
                         School of Computer Engineering
Clustering Application
69
    Marketing: It can be used to characterize & discover customer segments for
     marketing purposes.
    Biology: It can be used for classification among different species of plants and
     animals.
    Libraries: It is used in clustering different books on the basis of topics and
     information.
    Insurance: It is used to acknowledge the customers, their policies and identifying the
     frauds.
    City Planning: It is used to make groups of houses and to study their values based on
     their geographical locations and other factors present.
    Earthquake studies: By learning the earthquake affected areas, the dangerous zones
     can be determined.
    Healthcare: It can be used in identifying and classifying the cancerous gene.
    Search Engine: It is the backbone behind the search engines. Search engines try to
     group similar objects in one cluster and the dissimilar objects far from each other.
    Education: It can be used to monitor the students' academic performance. Based on
     the students' score they are grouped into different-different clusters, where each
     clusters denoting the different level of performance.
                           School of Computer Engineering
Types of clustering algorithms
70
Since the task of clustering is subjective, the means that it can be used for achieving this
goal are plenty. Every methodology follows a different set of rules for defining the
‘similarity’ among data points. The popular types are:
 Connectivity models  Distribution models
 Centroid models            Density models
Connectivity models: These models are based on the notion that the data points closer
in data space exhibit more similarity to each other than the data points lying farther away.
These models can follow two approaches. In the first approach, they start with classifying
all data points into separate clusters & then aggregating them as the distance decreases.
In the second approach, all data points are classified as a single cluster and then
partitioned as the distance increases. Also, the choice of distance function is subjective.
These models are very easy to interpret but lacks scalability for handling big datasets.
Examples of these models are hierarchical clustering algorithm and its variants.
Centroid models: These are iterative clustering algorithms in which the notion of
similarity is derived by the closeness of a data point to the centroid of the clusters. K-
Means clustering algorithm is a popular algorithm that falls into this category. In these
models, the no. of clusters required at the end have to be mentioned beforehand, which
makes it important to have prior knowledge of the dataset.
                           School of Computer Engineering
Types of clustering algorithms cont…
71
Distribution models: These clustering models are based on the notion of how probable
is it that all data points in the cluster belong to the same distribution. These models often
suffer from overfitting. A popular example of these models is Expectation-maximization
algorithm which uses multivariate normal distributions.
Density Models: These models search the data space for areas of varied density of data
points in the data space. It isolates various different density regions and assign the data
points within these regions in the same cluster. Popular examples of density models are
DBSCAN and OPTICS.
 Summary
 Connectivity based: Create a hierarchical decomposition of the set of data using
  some criterion.
 Centroid based: Construct various partitions and then evaluate them by some
  criterion.
 Distribution based: A model is hypothesized for each of the clusters and the idea is
  to find the best fit of that model to each other.
 Density based: Based on connectivity and density functions.
                            School of Computer Engineering
Hierarchical Clustering
72
Let’s say we have the below points and we want to cluster them into groups.
We can assign each of these points to a separate cluster.
Now, based on the similarity of these clusters, the most similar clusters combined
together and this process is repeated until only a single cluster is left.
 We are essentially building a hierarchy of clusters. That’s why this algorithm is
 called hierarchical clustering.
                           School of Computer Engineering
Hierarchical Clustering Example
73
Suppose a faculty wants to divide the students into different groups. The faculty has the
marks scored by each student in an assignment and based on these marks, he/she wants
to segment them into groups. There’s no fixed target as to how many groups to have.
Since the faculty does not know what type of students should be assigned to which
group, it cannot be solved as a supervised learning problem. So, hierarchical clustering
is applied to segment the students into different groups. Let’s take a sample of 5
students.
 Roll No      Mark             Creating a Proximity Matrix
 1            10              First, a proximity matrix to be created which tell the
                              distance between each of these points (marks). Since the
 2            7
                              distance is calculated of each point from each of the other
 3            28              points, a square matrix of shape n X n (where n is the
 4            20              number of observations) is obtained. Let’s make the 5 x 5
                              proximity matrix for the example.
 5            35
                           School of Computer Engineering
Hierarchical Clustering Example cont…
74
             Proximity Matrix
             Roll No    1          2          3         4          5
             1          0          3          18        10         25
             2          3          0          21        13         28
             3          18         21         0         8          7
             4          10         13         8         0          15
             5          25         28         7         15         0
The diagonal elements of this matrix is always 0 as the distance of a point with itself is
always 0. The Euclidean distance formula is used to calculate the rest of the distances. So,
to calculate the distance between
Point 1 and 2: √(10-7)^2 = √9 = 3
Point 1 and 3: √(10-28)^2 = √324 = 18 and so on…
Similarly, all the distances are calculated and the proximity matrix is filled.
                             School of Computer Engineering
Steps to Perform Hierarchical Clustering
75
Step 1: First, all the points to an individual cluster is assigned. Different colors here
represent different clusters. Hence, 5 different clusters for the 5 points in the data.
Step 2: Next, look at the smallest distance in the proximity matrix and merge the points
with the smallest distance. Then the proximity matrix is updated.
             Roll No   1          2         3          4         5
             1         0          3         18         10        25
             2         3          0         21         13        28
             3         18         21        0          8         7
             4         10         13        8          0         15
             5         25         28        7          15        0
Here, the smallest distance is 3 and hence point 1 and 2 is merged.
                            School of Computer Engineering
Steps to Perform Hierarchical Clustering cont…
76
Let’s look at the updated clusters and accordingly update the proximity matrix. Here, we
have taken the maximum of the two marks (7, 10) to replace the marks for this cluster.
Instead of the maximum, the minimum value or the average values can also be
considered.
                               Roll No      Mark
                               (1, 2)       10
                               3            28
                               4            20
                               5            35
 Now, the proximity matrix for these clusters is calculated again.
                          School of Computer Engineering
Steps to Perform Hierarchical Clustering cont…
77
             Revised Proximity Matrix
             Roll No    1, 2      3         4          5
             1, 2       0         18        10         25
             3          18        0         8          7
             4          10        8         0          15
             5          25        7         15         0
Step 3: Step 2 is repeated until only a single cluster is left. So, look at the minimum
distance in the proximity matrix and then merge the closest pair of clusters. We will get
the merged clusters after repeating these steps:
                             School of Computer Engineering
Steps to Perform Hierarchical Clustering cont…
78
We started with 5 clusters and finally have a single cluster.
                           School of Computer Engineering
Types of Hierarchical Clustering
79
There are mainly two types of hierarchical clustering:
 Agglomerative hierarchical clustering
 Divisive Hierarchical clustering
Agglomerative Hierarchical Clustering
    Each point is assigned to an individual cluster in this technique. Suppose there are 4
     data points, so each of these points would be assigned to a cluster and hence there
     would be 4 clusters in the beginning.
    Then, at each iteration, closest pair of clusters are merged and this step is repeated
     until only a single cluster is left.
                          School of Computer Engineering
Types of Hierarchical Clustering cont…
80
Divisive Hierarchical clustering
    Divisive hierarchical clustering works in the opposite way. Instead of starting with n
     clusters (in case of n observations), we start with a single cluster and assign all the
     points to that cluster. So, it doesn’t matter if we have 10 or 1000 data points. All
     these points will belong to the same cluster at the beginning.
    Now, at each iteration, farthest point in the cluster is split and this process is
     repeated until each cluster only contains a single point.
                           School of Computer Engineering
Dendrogram
81
    To get the number of clusters for hierarchical clustering, we make use of the concept
     called a Dendrogram.
    A dendrogram is a tree-like diagram that records the sequences of merges or splits.
    Let’s get back to faculty-student example. Whenever we merge two clusters, a
     dendrogram record the distance between these clusters and represent it in graph
     form.
                                                    We have the samples of the dataset
                                                    on the x-axis and the distance on the
                                                    y-axis. Whenever two clusters are
                                                    merged, we will join them in this
                                                    dendrogram and the height of the
                                                    join will be the distance between
                                                    these points.
                           School of Computer Engineering
Dendrogram cont…
82
                                           We started by merging sample 1 and 2 and the
                                           distance between these two samples was 3.
                                           Let’s plot this in the dendrogram.
     Here, we can see that we have merged sample 1 and 2. The vertical line represents the
     distance between these samples.
                             School of Computer Engineering
Dendrogram cont…
83
Similarly, we plot all the steps where we merged the clusters and finally, we get a
dendrogram like this:
 We can clearly visualize the steps of hierarchical clustering. More the distance of the
 vertical lines in the dendrogram, more the distance between those clusters.
                         School of Computer Engineering
Dendrogram cont…
84
Now, we can set a threshold distance and draw a horizontal line (Generally, the threshold
is set in such a way that it cuts the tallest vertical line). Let’s set this threshold as 12 and
draw a horizontal line:
  The number of clusters will be the number of vertical lines which are being intersected
  by the line drawn using the threshold. In the above example, since the red line intersects
  2 vertical lines, we will have 2 clusters. One cluster will have a sample (1,2,4) and the
  other will have a sample (3,5).
                             School of Computer Engineering
Hierarchical Clustering closeness of two clusters
85
The decision of merging two clusters is taken on the basis of closeness of these
clusters. There are multiple metrics for deciding the closeness of two clusters
and primarily are:                             Manhattan distance
 Euclidean distance                           Maximum distance
 Squared Euclidean distance                   Mahalanobis distance
                       School of Computer Engineering
Portion
86
          School of Computer Engineering
Partitioning Methods
87
    Partitioning is used to make solving mathematics problems involving large
     numbers easier by separating them into smaller units. For example, 782 can be
     partitioned into: 700 + 80 + 2. It helps kids see the true value of each digit.
     Rather than seeing 782 as an intimidating number, they'll see it as, 700, 80 and
     2. Therefore, given n objects or data items, a partitioning method constructs k
     partitions of the data, where each partition represents a cluster and k <= n.
    That is, it classifies the data into k groups, which together satisfy the following
     requirements:
         Each group must contain at least one object
         Each object must belong to exactly one group
    Given k, the number of partitions to construct, a partitioning method creates an
     initial partition. It uses iterative relocation technique that attempts to improve
     the partitioning by moving objects from one group to another.
    The general criterion of a good partitioning is that objects in the same cluster
     are “close” or related to each other, whereas objects of different clusters are “far
     apart” or very different.
                          School of Computer Engineering
Partitioning Methods cont…
88
K-Medoids and K-Means are two types of clustering mechanisms in Partition
Clustering.
   The k-means algorithm, where each cluster is represented by the mean value of
    the data points in the cluster. The center of a cluster is not necessarily one of the
    input data points.
   The k-medoids algorithm, where each cluster is represented by one of the data
    points (medoid) located near the center of the cluster. A medoid can be defined
    as data points of a cluster, whose average dissimilarity to all the objects in the
    cluster is minimal i.e. it is a most centrally located point in the given dataset.
                          School of Computer Engineering
Portion
89
          School of Computer Engineering
K- Means Clustering Algorithm
90
    It is an iterative algorithm that divides the unlabeled dataset into k different
     clusters in such a way that each dataset belongs only one group that has similar
     properties.
    It is a centroid-based algorithm, where each cluster is associated with a
     centroid. The main aim of this algorithm is to minimize the sum of distances
     between the data point and their corresponding clusters. A centroid is a data
     point (imaginary or real) at the center of a cluster.
    The algorithm takes the unlabeled dataset as input, divides the dataset into k-
     number of clusters, and repeats the process until it does not find the best
     clusters. The value of k should be predetermined in this algorithm.
    The k-means clustering algorithm mainly performs two tasks:
      Determines the best value for k center points or centroids by an iterative
          process.
      Assigns each data point to its closest k-center. Those data points which are
          near to the particular k-center, create a cluster.
    Hence each cluster has datapoints with some commonalities, and it is away
     from other clusters.
                         School of Computer Engineering
K- Means Algorithm cont…
91
     The below diagram explains the working of the K-means Clustering Algorithm:
                       School of Computer Engineering
How does the K-Means Algorithm Work?
92
1.   Begin
2.   Step-1: Select the number K to decide the number of clusters.
3.   Step-2: Select random K points or centroids. (It can be other from the input
     dataset).
4.   Step-3: Assign each data point to their closest centroid, which will form the
     predefined K clusters.
5.   Step-4: Calculate the variance and place a new centroid of each cluster.
6.   Step-5: Repeat the third steps, which means reassign each datapoint to the new
     closest centroid of each cluster.
7.   Step-6: If any reassignment occurs, then go to step-4 else go to step-7.
8.   Step-7: The model is ready.
9.   End
                         School of Computer Engineering
Working of K-Means Algorithm
93
Suppose we have two variables x and y. The x-y axis scatter plot of these two variables is
given below:
                           School of Computer Engineering
Working of K-Means Algorithm cont…
94
    Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
     into different clusters. It means here we will try to group these datasets into two
     different clusters.
    We need to choose some random K points or centroid to form the cluster. These
     points can be either the points from the dataset or any other point. So, here we
     are selecting the below two points as K points, which are not the part of dataset.
     Consider the below image:
                          School of Computer Engineering
Working of K-Means Algorithm cont…
95
    Now we will assign each data point of the scatter plot to its closest K-point or
     centroid. We will compute it by calculating the distance between two points. So,
     we will draw a median between both the centroids. Consider the below image:
                         School of Computer Engineering
Working of K-Means Algorithm cont…
96
    From the image, it is clear that points left side of the line is near to the K1 or
     blue centroid, and points to the right of the line are close to the yellow centroid
     i.e. K2. Let's color them as blue and yellow for clear visualization.
                          School of Computer Engineering
Working of K-Means Algorithm cont…
97
    As we need to find the closest cluster, so we will repeat the process by choosing
     a new centroid. To choose the new centroids, we will compute the center of
     gravity of these centroids, and will find new centroids as below:
                         School of Computer Engineering
Working of K-Means Algorithm cont…
98
    Next, we will reassign each datapoint to the new centroid. For this, we will
     repeat the same process of finding a median line. The median will be like below
     image:
 From the above image, we can see, one yellow point is on the left side of the line, and
 two blue points are right to the line. So, these three points will be assigned to new
 centroids.
                         School of Computer Engineering
Working of K-Means Algorithm cont…
99
    As reassignment has taken place, so we will again go to the step-4, which is
     finding new centroids or K-points.
                       School of Computer Engineering
Working of K-Means Algorithm cont…
100
     We will repeat the process by finding the center of gravity of centroids, so the
      new centroids will be as shown in the below image:
                          School of Computer Engineering
Working of K-Means Algorithm cont…
101
     As we got the new centroids so again will draw the median line and reassign the
      data points. So, the image will be:
                          School of Computer Engineering
Working of K-Means Algorithm cont…
102
     We can see in the previous image; there are no dissimilar data points on either
      side of the line, which means our model is formed. Consider the below image:
                          School of Computer Engineering
Working of K-Means Algorithm cont…
103
     As our model is ready, so we can now remove the assumed centroids, and the
      two final clusters will be as shown in the below image:
                         School of Computer Engineering
How to Determine the optimal K for K-Means?
104
How to determine the optimal value of K in K-Means clustering? There are 2
methods:
 The Elbow Method
 The Silhouette Method
                      School of Computer Engineering
105
      School of Computer Engineering