Module 2
Supervised Learning Algorithms
Contents
• Regression
-Linaer
-Logestic
-Polynomial
• Classification
- KNN Classifier
- Decision Tree
- Random Forest
- SVM
Performance Matrix – Confusion Matrix
Polynomial Regression
• Polynomial Regression is a regression algorithm that models the
  relationship between a dependent(y) and independent variable(x) as
  nth degree polynomial.
• The Polynomial Regression equation is given below:
•
Linear vs Polynomial
• The main steps involved in Polynomial Regression are given below:
•
K-Nearest Neighborhood Algorithm (KNN)
• Intution behind KNN Algorithm
Features
• (K-NN) algorithm is a versatile and widely used machine learning algorithm
  that is primarily used for its simplicity and ease of implementation.
• It does not require any assumptions about the underlying data distribution.
• It can also handle both numerical and categorical data, making it a flexible
  choice for various types of datasets in classification and regression tasks.
• It is a non-parametric method that makes predictions based on the
  similarity of data points in a given dataset.
• K-NN is less sensitive to outliers compared to other algorithms.
•
• The K-NN algorithm works by finding the K nearest neighbors to a
  given data point based on a distance metric, such as Euclidean
  distance.
• The class or value of the data point is then determined by the
  majority vote or average of the K neighbors.
• This approach allows the algorithm to adapt to different patterns and
  make predictions based on the local structure of the data.
•
Distance Metrics Used in KNN Algorithm
• Euclidean Distance
• Manhattan Distance
•
• The K-NN algorithm compares a new data entry to the values in a
  given data set (with different classes or categories).
• Based on its closeness or similarities in a given range (K) of neighbors,
  the algorithm assigns the new data to a class or category in the data
  set (training data).
•
Steps in KNN Algorithm
KNN Example 1
• Since the value of K is 3, the algorithm will only consider the 3 nearest
  neighbors to the green point (new entry). This is represented in the
  graph above.
•
KNN Example 2
• Consider following dataset
Assumptions
KNN Algorithm
Decision Tree
• Decision trees, a key tool in machine learning,
• This model predict outcomes based on input data through a tree-like
  structure.
• They offer interpretability, versatility, and simple visualization,
  making them valuable for both categorization and regression tasks.
•
Concept
• It is a tree-like structure where
- each internal node tests on attribute,
- each branch corresponds to attribute value and
- each leaf node represents the final decision or prediction.
• While decision trees have advantages like ease of understanding,
  they may face challenges such as overfitting.
• Understanding their terminologies and formation process is essential
  for effective application in diverse scenarios.
• Decision trees are upside down which means the root is at the top
  and then this root is split into various several nodes.
• Decision trees are nothing but a bunch of if-else statements in layman
  terms.
• It checks if the condition is true and if it is then it goes to the next
  node attached to that decision.
Example 1:
• Here, it will ask –
• what is the weather?
• Is it sunny, cloudy, or rainy?
• If yes then it will go to the next feature which is humidity and wind.
• It will again check if there is a strong wind or weak, if it’s a weak wind
  and it’s rainy then the person may go and play.
We see that if the weather is cloudy then we must go to play.
Why didn’t it split more? Why did it stop there?
• But in simple terms,
• output for the training dataset is always yes for cloudy weather, since
  there is no disorderliness here we don’t need to split the node
  further.
• Entropy, information gain, and Gini index.
• The goal of machine learning is to decrease uncertainty or disorders
  from the dataset and for this, we use decision trees.
Questions
• How do I know what should be the root node?
• what should be the decision node?
• when should I stop splitting?
• To decide this, there is a metric called “Entropy” which is the amount
  of uncertainty in the dataset.
• Decision Tree algorithm works in simpler steps
• Starting at the Root: The algorithm begins at the top, called the “root
  node,” representing the entire dataset.
• Asking the Best Questions: It looks for the most important feature or
  question that splits the data into the most distinct groups. This is like
  asking a question at a fork in the tree.
• Branching Out: Based on the answer to that question, it divides the
  data into smaller subsets, creating new branches. Each branch
  represents a possible route through the tree.
• Repeating the Process: The algorithm continues asking questions and
  splitting the data at each branch until it reaches the final “leaf
  nodes,” representing the predicted outcomes or classifications.
Entropy
• Entropy is nothing but the uncertainty in our dataset or measure of
  disorder.
• Examples to understand concept of Entropy
Example 1
Left node Entropy
• Feature 2:
Right Node Entropy
• For Feature 3:
• Left node has low entropy or more purity than right node since left
  node has a greater number of “yes” and it is easy to decide here.
• the higher the Entropy, the lower will be the purity and the higher
  will be the impurity.
• The goal of machine learning is to decrease the uncertainty or
  impurity in the dataset, here by using the entropy we are getting
  the impurity of a particular node
• we don’t know if the parent entropy or the entropy of a particular
  node has decreased or not.
• New metric called “Information gain” which tells us how much the
  parent entropy has decreased after splitting it with some feature.
Information Gain
• Information gain measures the reduction of uncertainty given some
  feature and it is also a deciding factor for which attribute should be
  selected as a decision node or root node.
• It is just entropy of the full dataset – entropy of the dataset given
  some feature.
Example
• Suppose our entire population has a total of 30 instances.
• The dataset is to predict whether the person will go to the gym or
  not. Let’s say 16 people go to the gym and 14 people don’t
• Decide Features as Feature 1: “Energy” which takes two values “high
  - 13” and “low 17”
• Feature 2 is “Motivation” which takes 3 values “No motivation”,
  “Neutral” and “Highly motivated”.
• Use Decision Tree
• Use Information gain to decide which feature should be the root
  node and which feature should be placed after the split.
• Using Feature 1
- Calculate Entropy
- Calculate Information Gain
• Entropy and Information Gain
• Parent entropy was near 0.99 and after looking at this value of information
  gain-
Conclusion : entropy of the dataset will decrease by 0.37 if we make
“Energy” as our root node.
• Feature 2
• Conclusions:
• “Energy” feature gives more reduction which is 0.37 than the
  “Motivation” feature. Hence we will select the feature which has the
  highest information gain and then split the node based on that
  feature.
• “Energy” will be our root node and we’ll do the same for
  sub-nodes. Here we can see that when the energy is “high” the
  entropy is low and hence we can say a person will definitely go to
  the gym if he has high energy,
• but what if the energy is low? We will again split the node based on
  the new feature which is “Motivation”.
Prunning
• Pruning is another method that can help us avoid overfitting. It helps
  in improving the performance of the Decision tree by cutting the
  nodes or sub-nodes which are not significant. Additionally, it removes
  the branches which have very low importance.
• There are mainly 2 ways for pruning:
• Pre-pruning – we can stop growing the tree earlier, which means we
  can prune/remove/cut a node if it has low importance while
  growing the tree.
• Post-pruning – once our tree is built to its depth, we can start
  pruning the nodes based on their significance.
Example 3
    SVM (Support Vector Machine)
⦿ Concept
⦿ Types
⦿ Linear
⦿ Non-linear
⦿ Use of Dot products
⦿ Examples
⦿ Kernel in SVM
Concept
• SVM is a powerful supervised algorithm that works best on smaller
  datasets but on complex ones.
• used for both regression and classification tasks, but generally, they
  work best in classification problems.
• It is a supervised machine learning problem where we try to find a
  hyperplane that best separates the two classes.
• Don’t get confused between SVM and logistic regression.
• Both the algorithms try to find the best hyperplane, but the main
  difference is logistic regression is a probabilistic approach whereas
  support vector machine is based on statistical approaches.
• Answers to questions like –
- which hyperplane does it select?
- There can be an infinite number of hyperplanes passing through a
  point and classifying the two classes perfectly.
- So, which one is the best?
• Depending on the number of features you have you can either
  choose Logistic Regression or SVM.
• SVM works best when the dataset is small and complex.
• advisable to first use logistic regression and see how does it performs,
  if it fails to give a good accuracy you can go for SVM without any
  kernel
• Logistic regression and SVM without any kernel have similar
  performance but depending on your features, one may be more
  efficient than the other.
                         Types of SVM
• Linear SVM: When the data is perfectly linearly separable only then
  we can use Linear SVM. Perfectly linearly separable means that the
  data points can be classified into 2 classes by using a single straight
  line(if 2D).
• Non-Linear SVM: When the data is not linearly separable then we can
  use Non-Linear SVM, which means when the data points cannot be
  separated into 2 classes by using a straight line (if 2D) then we use
  some advanced techniques like kernel tricks to classify them.
• In most real-world applications we do not find linearly separable
  datapoints hence we use kernel trick to solve them.
Important Definitions
• Support Vectors: These are the points that are closest to the
  hyperplane. A separating line will be defined with the help of these
  data points.
• Margin: it is the distance between the hyperplane and the
  observations closest to the hyperplane (support vectors). In SVM
  large margin is considered a good margin. There are two types of
  margins hard margin and soft margin.
Example – Linear SVM
• We want to classify that the new data point as either blue or green.
• To classify these points, we can have many decision boundaries, but
  the question is which is the best and how do we find it?
• The best hyperplane is that plane that has the maximum distance
  from both the classes, and this is the main aim of SVM.
• This is done by finding different hyperplanes which classify the labels
  in the best way then it will choose the one which is farthest from the
  data points or the one which has a maximum margin.
    How does it work?
⦿ Identify Cat or Dog?
⦿ Support Vectors :
⦿ Linear SVM : Hyperplane
⦿ Non-linear SVM example
Example - Non –linear SVM
⦿ Finding equation for SV :
⦿ Final Classification result
               Use of Dot Product in SVM
• The dot product can be defined as the projection of one vector along
  with another, multiply by the product of another vector.
• Consider a random point X and we want to know whether it lies on
  the right side of the plane or the left side of the plane (positive or
  negative).
• Assume this point is a vector (X) and then we make a vector (w) which
  is perpendicular to the hyperplane. Let’s say the distance of vector w
  from origin to decision boundary is ‘c’. Now we take the projection of
  X vector on w.
• Criteria for Classification based on dot product:
- projection of any vector or another vector is called dot-product. we
take the dot product of x and w vectors.
• If the dot product is greater than ‘c’ point lies on the right side.
• If the dot product is less than ‘c’ then the point is on the left side
• If the dot product is equal to ‘c’ then the point lies on the decision
  boundary.
Margin in Support Vector Machine
• To classify a point as negative or positive we need to define a decision
  rule.
• The equation of a hyperplane is w.x+b=0 where w is a vector normal
  to hyperplane and b is an offset.
• If the value of w.x+b>0 then we can say it is a positive point otherwise
  it is a negative point.
• we need (w,b) such that the margin has a maximum distance. Let’s
  say this distance is ‘d’.
• To calculate ‘d’ we need the equation of L1 and L2.
• For this, we will take few assumptions that –
• the equation of L1 is w.x+b=1 and for
• L2 it is w.x+b=-1.
• Why the magnitude is equal, why didn’t we take 1 and -2?
• Why did we only take 1 and -1, why not any other value like 24 and
  -100?
• Why did we assume this line?
Example:
• Let’s say the equation of our hyperplane is 2x+y=2
• Create margin for this hyperplane,
                            Summary
• If you multiply these equations by 10,
- the parallel line (red and green) gets closer to our hyperplane.
• If we divide this equation by 10
-then these parallel lines get bigger
• The parallel lines depend on (w,b) of our hyperplane,
• If we multiply the equation of hyperplane with a factor greater than
  1 then the parallel lines will shrink
• If we multiply with a factor less than 1, they expand.
• These lines will move as we do changes in (w,b) and this is how this
  gets optimized.
                            SVM Error
• SVM Error = Margin Error + Classification Error.
• The higher the margin, the lower would-be margin error, and vice
  versa
• high value of ‘c’ =1000, this would mean that you don’t want to focus
  on margin error and just want a model which doesn’t misclassify any
  data point.
• Which is a better model?
- the one where the margin is maximum and has 2 misclassified points
  or
- the one where the margin is very less, and all the points are correctly
  classified?
• Increase ‘c’ to decrease Classification Error but
• If you want that your margin should be maximized then the value of
  ‘c’ should be minimized.
• That’s why ‘c’ is a hyperparameter and we find the optimal value of
  ‘c’
Kernels in SVM
• Need
• Solution:
• Converting this lower dimension space to a higher dimension space
  using some quadratic functions which will allow us to find a decision
  boundary that clearly divides the data points.
• These functions which help us do this are called Kernels.
• which kernel to use is purely determined by hyperparameter tuning.
• Use of Kernel
Evaluation Matrix for classification:
Confusion Matrix
• Machine learning models are increasingly used in various applications
  to classify data into different categories.
• However, evaluating the performance of these models is crucial to
  ensure their accuracy and reliability.
• One essential tool in this evaluation process is the confusion matrix.
• A confusion matrix is a matrix that summarizes the performance of a
  machine learning model on a set of test data.
• It is a means of displaying the number of accurate and inaccurate
  instances based on the model’s predictions.
• It is often used to measure the performance of classification models,
  which aim to predict a categorical label for each input instance.
•
• The matrix displays the number of instances produced by the model
  on the test data.
•
Metrics based on Confusion Matrix Data
Confusion Matrix For binary classification
• A 2X2 Confusion matrix is shown below for the image recognition
  having a Dog image or Not Dog image.
• Scenario: Example: Confusion Matrix for Dog Image Recognition
  with Numbers
• Confusion Matrix
Confusion Matrix For Multi-class
Classification
• In multi-class classification, you have more than two possible classes
  for your model to predict. The confusion matrix expands to
  accommodate these additional classes.
• Rows: Represent the actual classes (ground truth) in your dataset.
• Columns: Represent the predicted classes by your model.
• Each cell within the matrix shows the count of instances where the
  model predicted a particular class (column) when the actual class was
  another (row).
 • A 3X3 Confusion matrix is shown below for the image having three
   classes.
 • Example: Confusion Matrix for Image Classification (Cat, Dog, Horse)
•True Positive (TP): The image was of a particular animal (cat,
 dog, or horse), and the model correctly predicted that animal.
 For example, a picture of a cat correctly identified as a cat.
                                        •False Negative (FN): The image was of a particular animal, but
                                         the model incorrectly predicted it as a different animal. For
                                         example, a picture of a dog mistakenly identified as a cat.
• In this scenario:
Cats: 8 were correctly identified, 1 was misidentified as a dog, and 1
was misidentified as a horse.
Dogs: 10 were correctly identified, 2 were misidentified as cats.
Horses: 8 were correctly identified, 2 were misidentified as dogs.