MLT Unit 2 Notes
MLT Unit 2 Notes
SUPERVISED LEARNING
1.Introduction
Supervised learning is the types of machine learning in which machines are trained using well "labelled"
training data, and on basis of that data, machines predict the output. The labelled data means some input
data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that teaches the
machines to predict the output correctly. It applies the same concept as a student learns in the supervision
of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the machine
learning model. The aim of a supervised learning algorithm is to find a mapping function to map the
input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud
Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model learns about each type
of data. Once the training process is completed, the model is tested on the basis of test data (a subset of
the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and
Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the
shape on the bases of a number of sides, and predicts the output.
Page 1
o First Determine the type of training dataset
o Split the training dataset into training dataset, test dataset, and validation dataset.
o Determine the input features of the training dataset, which should have enough knowledge so that
the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine, decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets as the control
parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts the correct
output, which means our model is accurate.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are two
classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
Page 2
o Logistic Regression
o With the help of supervised learning, the model can predict the output on the basis of prior
experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud detection,
spam filtering, etc.
o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from the training
dataset.
GENERATIVE ALGORITHMS
DISCRIMINATIVE ALGORITHMS
Page 3
k-nearest neighbors (k-NN)
Logistic regression
Decision Trees
Random Forest
Let’s assume our task is to determine the language of a text document. How do we solve this task with the
help of machine learning?
We can learn each language and then determine the language. This is how generative models work.
Alternatively, we can learn just the linguistic differences and common patterns of languages without
actually learning the language. This is the discriminative approach. In this case, we don’t speak any
language.
In other words, discriminative algorithms focus on how to distinguish cases. Hence, they focus on
learning a decision boundary. On the other hand, generative algorithms learn the fundamental properties
of the data and how to generate it from scratch:
The generative approach focuses on modeling, whereas the discriminative approach focuses on a
solution. So, we can use generative algorithms to generate new data points. Discriminative algorithms
don’t serve that purpose.
Still, discriminative algorithms generally perform better for classification tasks. That’s because they
focus on solving the actual problem directly instead of solving a more general problem first.
Yet, the real strength of generative algorithms lies in their ability to express complex relationships
between variables. In other words, they have explanatory power. As a result, they have successful use
cases in NLP and medicine.
On the other hand, discriminative algorithms feel like black-boxes, without the ability to express their
decision boundaries in simple terms. The relationships between variables are not explicitly explainable.
Therefore, we can’t visualize it easily.
Besides, generative models are suited to solve unsupervised learning tasks, as well as supervised learning
tasks, since they have predictive ability. Discriminative models require labeled datasets and can’t deduce
from a context. Consequently, generative models have more comprehensive applications in anomaly
detection and monitoring areas.
Page 4
Moreover, generative algorithms converge faster than discriminative algorithms. Thus, we prefer
generative models when we have a small training dataset.
Even though the generative models converge faster, they converge to a higher asymptotic error. On the
contrary, the discriminative models converge to a smaller asymptotic error. So, as the number of
training examples increases, the error rate decreases for the discriminative models.
To summarize, generative and discriminative algorithms have their own strengths and weaknesses:
3. LINEAR REGRESSION
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or
numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to the
value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:
y= a0+a1x+ ε
Here,
Linear regression can be further divided into two types of the algorithm:
A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:
Page 6
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression, so
we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use cost
function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the best fit
line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear regression
model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the input
variable to the output variable. This mapping function is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of
squared error occurred between the predicted values and actual values. It can be written as:
Where,
Residuals: The distance between the actual value and predicted values is called residual. If the observed
points are far from the regression line, then the residual will be high, and so cost function will high. If the
scatter points are close to the regression line, then the residual will be small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
o A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function.
o It is done by a random selection of values of coefficient and then iteratively update the values to
reach the minimum cost function.
Model Performance:
Page 7
The Goodness of fit determines how the line of regression fits the set of observations. The process of
finding the best model out of various models is called optimization. It can be achieved by below method:
1. R-squared method:
o It measures the strength of the relationship between the dependent and independent variables on a
scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values and actual
values and hence represents a good model.
Below are some important assumptions of Linear Regression. These are some formal checks while
building a Linear Regression model, which ensures to get the best possible result from the given dataset.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of independent
variables. With homoscedasticity, there should be no clear pattern distribution of data in the
scatter plot.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any
Page 8
correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.
Least Square Regression is a statistical method commonly used in machine learning for analyzing and
modelling data. It involves finding the line of best fit that minimizes the sum of the squared residuals (the
difference between the actual values and the predicted values) between the independent variable(s) and
the dependent variable.
We can use Least Square Regression for both simple linear regression, where there is only one
independent variable. Also, for multiple linear regression, where there are several independent variables.
We widely use this method in a variety of fields, such as economics, engineering, and finance, to model
and predict relationships between variables. Before learning least square regression, let’s understand
linear regression.
Linear Regression
Linear regression is one of the basic statistical techniques in regression analysis. People use it for
investigating and modelling the relationship between variables (i.e. dependent variable and one or more
independent variables).
Before being promptly adopted into machine learning and data science, linear models were used as basic
statistical tools to assist prediction analysis and data mining. If the model involves only one regressor
variable (independent variable), it is called simple linear regression, and if the model has more than one
regressor variable, the process is called multiple linear regression.
Let’s consider a simple example of an engineer wanting to analyze vending machines' product delivery
and service operations. He/she wants to determine the relationship between the time required by a
deliveryman to load a machine and the volume of the products delivered. The engineer collected the
delivery time (in minutes) and the volume of the products (in a number of cases) of 25 randomly selected
retail outlets with vending machines. The scatter diagram is the observations plotted on a graph.
Now, if I consider Y as delivery time (dependent variable), and X as product volume delivered
(independent variable). Then we can represent the linear relationship between these two variables as
Page 9
Okay! Now that looks familiar. Its equation is for a straight line, where m is the slope and c is the y-
intercept. Our objective is to estimate these unknown parameters in the regression model such that they
give minimal error for the given dataset. Commonly referred to as parameter estimation or model
fitting. In machine learning, the most common method of estimation is the Least Squares method.
Least squares is a commonly used method in regression analysis for estimating the unknown parameters
by creating a model which will minimize the sum of squared errors between the observed data and the
predicted data.
Basically, it is one of the widely used methods of fitting curves that works by minimizing the sum of
squared errors as small as possible. It helps you draw a line of best fit depending on your data points.
Given any collection of a pair of numbers and the corresponding scatter graph, the line of best fit is the
straight line that you can draw through the scatter points to represent the relationship between them best.
So, back to our equation of the straight line, we have:
Where,
Y: Dependent Variable
m: Slope
X: Independent Variable
c: y-intercept
Our aim here is to calculate the values of slope y-intercept and substitute them in the equation along with
the values of independent variable X to determine the values of dependent variable Y. Let’s assume that
we have ‘n’ data points, then we can calculate slope using the scary looking formula below:
Page 10
Lastly, we substitute these values in the final equation Y = mX + c. Simple enough, right? Now let’s take
a real-life example and implement these formulas to find the line of best fit.
Let us take a simple dataset to demonstrate the least squares regression method.
Step 1: The first step is to calculate the slope ‘m’ using the formula
Step 2: Next, calculate the y-intercept ‘c’ using the formula (ymean — m * xmean). By doing that, the value
of c approximately is c = 6.67.
Step 3: Now we have all the information needed for the equation, and by substituting the respective
values in Y = mX + c, we get the following table. Using this information, you can now plot the graph.
This way, the least squares regression method provides the closest relationship between the dependent
and independent variables by minimizing the distance between the residuals (or error) and the trend line
(or line of best fit). Therefore, the sum of squares of residuals (or error) is minimal under this approach.
5.UNDERFITTING / OVERFITTING
Overfitting and Underfitting are the two main problems that occur in machine learning and degrade the
performance of the machine learning models.
The main goal of each machine learning model is to generalize well. Here generalization defines the
ability of an ML model to provide a suitable output by adapting the given set of unknown input. It means
Page 11
after providing training on the dataset, it can produce reliable and accurate output. Hence, the underfitting
and overfitting are the two terms that need to be checked for the performance of the model and whether
the model is generalizing well or not.
Before understanding the overfitting and underfitting, let's understand some basic term that will help to
understand this topic well:
o Signal: It refers to the true underlying pattern of the data that helps the machine learning model to
learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying the machine
learning algorithms. Or it is the difference between the predicted values and the actual values.
o Variance: If the machine learning model performs well with the training dataset, but does not
perform well with the test dataset, then variance occurs.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points or more than the
required data points present in the given dataset. Because of this, the model starts caching noise and
inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of the
model. The overfitted model has low bias and high variance.
The chances of occurrence of overfitting increase as much we provide training to our model. It means the
more we train our model, the more chances of occurring the overfitted model.
Example: The concept of the overfitting can be understood by the below graph of the linear regression
output:
As we can see from the above graph, the model tries to cover all the data points present in the scatter plot.
It may look efficient, but in reality, it is not so. Because the goal of the regression model to find the best
fit line, but here we have not got any best fit, so, it will generate the prediction errors.
Page 12
How to avoid the Overfitting in Model
Both overfitting and underfitting cause the degraded performance of the machine learning model. But the
main cause is overfitting, so there are some ways by which we can reduce the occurrence of overfitting in
our model.
o Cross-Validation
o Removing features
o Regularization
o Ensembling
Underfitting
Underfitting occurs when our machine learning model is not able to capture the underlying trend of the
data. To avoid the overfitting in the model, the fed of training data can be stopped at an early stage, due to
which the model may not learn enough from the training data. As a result, it may fail to find the best fit of
the dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the training data, and hence it
reduces the accuracy and produces unreliable predictions.
Example: We can understand the underfitting using below output of the linear regression model:
As we can see from the above diagram, the model is unable to capture the data points present in the plot.
Page 13
Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the machine learning models to
achieve the goodness of fit. In statistics modeling, it defines how closely the result or predicted values
match the true values of the dataset.
The model with a good fit is between the underfitted and overfitted model, and ideally, it makes
predictions with 0 errors, but in practice, it is difficult to achieve it.
As when we train our model for a time, the errors in the training data go down, and the same happens
with test data. But if we train the model for a long duration, then the performance of the model may
decrease due to the overfitting, as the model also learn the noise present in the dataset. There are two
other methods by which we can get a good point for our model, which are the resampling method to
estimate model accuracy and validation dataset.
6. CROSS VALIDATION
Cross-validation is a technique for validating the model efficiency by training it on the subset of input
data and testing on previously unseen subset of the input data. We can also say that it is a technique to
check how a statistical model generalizes to an independent dataset.
In machine learning, there is always the need to test the stability of the model. It means based only on the
training dataset; we can't fit our model on the training dataset. For this purpose, we reserve a particular
sample of the dataset, which was not part of the training dataset. After that, we test our model on that
sample before deployment, and this complete process comes under cross-validation. This is something
different from the general train-test split.
o Now, evaluate model performance using the validation set. If the model performs well with the
validation set, perform the further step, else check for the issues.
There are some common methods that are used for cross-validation. These methods are given below:
2. Leave-P-out cross-validation
4. K-fold cross-validation
We divide our input dataset into a training set and test or validation set in the validation set approach.
Both the subsets are given 50% of the dataset.
Page 14
But it has one of the big disadvantages that we are just using a 50% dataset to train our model, so the
model may miss out to capture important information of the dataset. It also tends to give the underfitted
model.
Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It means, if there are total n datapoints in
the original input dataset, then n-p data points will be used as the training dataset and the p data points as
the validation set. This complete process is repeated for all the samples, and the average error is
calculated to know the effectiveness of the model.
There is a disadvantage of this technique; that is, it can be computationally difficult for the large p.
This method is similar to the leave-p-out cross-validation, but instead of p, we need to take 1 dataset out
of training. It means, in this approach, for each learning set, only one datapoint is reserved, and the
remaining dataset is used to train the model. This process repeats for each datapoint. Hence for n samples,
we get n different training set and n test set. It has the following features:
o In this approach, the bias is minimum as all the data points are used.
o This approach leads to high variation in testing the effectiveness of the model as we iteratively
check against one data point.
K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples of equal sizes. These
samples are called folds. For each learning set, the prediction function uses k-1 folds, and the rest of the
folds are used for the test set. This approach is a very popular CV approach because it is easy to
understand, and the output is less biased than other methods.
o Fit the model on the training set and evaluate the performance of the model using the test
set.
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds. On 1st iteration,
the first fold is reserved for test the model, and rest are used to train the model. On 2nd iteration, the
second fold is used to test the model, and rest are used to train the model. This process will continue until
each fold is not used for the test fold.
Page 15
Stratified k-fold cross-validation
This technique is similar to k-fold cross-validation with some little changes. This approach works on
stratification concept, it is a process of rearranging the data to ensure that each fold or group is a good
representative of the complete dataset. To deal with the bias and variance, it is one of the best approaches.
It can be understood with an example of housing prices, such that the price of some houses can be much
high than other houses. To tackle such situations, a stratified k-fold cross-validation technique is useful.
Holdout Method
This method is the simplest cross-validation technique among all. In this method, we need to remove a
subset of the training data and use it to get prediction results by training it on the rest part of the dataset.
The error that occurs in this process tells how well our model will perform with the unknown dataset.
Although this approach is simple to perform, it still faces the issue of high variance, and it also produces
misleading results sometimes.
o Train/test split: The input data is divided into two parts, that are training set and test set on a
ratio of 70:30, 80:20, etc. It provides a high variance, which is one of the biggest disadvantages.
o Training Data: The training data is used to train the model, and the dependent variable is
known.
o Test Data: The test data is used to make the predictions from the model that is already
trained on the training data. This has the same features as training data but not the part of
that.
Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are given below:
Page 16
o For the ideal conditions, it provides the optimum output. But for the inconsistent data, it may
produce a drastic result. So, it is one of the big disadvantages of cross-validation, as there is no
certainty of the type of data in machine learning.
o In predictive modeling, the data evolves over a period, due to which, it may face the differences
between the training set and validation sets. Such as if we create a model for the prediction of
stock market values, and the data is trained on the previous 5 years stock values, but the realistic
future values for the next 5 years may drastically different, so it is difficult to expect the correct
output for such situations.
Applications of Cross-Validation
o This technique can be used to compare the performance of different predictive modeling methods.
o It can also be used for the meta-analysis, as it is already being used by the data scientists in the
field of medical statistics.
7. LASSO REGRESSION
Lasso regression is like linear regression, but it uses a technique "shrinkage" where the coefficients of
determination are shrunk towards zero.
Linear regression gives you regression coefficients as observed in the dataset. The lasso regression allows
you to shrink or regularize these coefficients to avoid overfitting and make them work better on different
datasets.
This type of regression is used when the dataset shows high multicollinearity or when you want to
automate variable elimination and feature selection.
Choosing a model depends on the dataset and the problem statement you are dealing with. It is essential
to understand the dataset and how features interact with each other.
Lasso regression penalizes less important features of your dataset and makes their respective coefficients
zero, thereby eliminating them. Thus it provides you with the benefit of feature selection and simple
model creation.
So, if the dataset has high dimensionality and high correlation, lasso regression can be used.
Page 17
Statistics of lasso regression
d1, d2, d3, etc., represents the distance between the actual data points and the model line in the above
graph.
Least-squares is the sum of squares of the distance between the points from the plotted curve.
In linear regression, the best model is chosen in a way to minimize the least-squares.
While performing lasso regression, we add a penalizing factor to the least-squares. That is, the model is
chosen in a way to reduce the below loss function to a minimal value.
Lasso regression penalty consists of all the estimated parameters. Lambda can be any value between zero
to infinity. This value decides how aggressive regularization is performed. It is usually chosen using
cross-validation.
Lasso penalizes the sum of absolute values of coefficients. As the lambda value increases, coefficients
decrease and eventually become zero. This way, lasso regression eliminates insignificant variables from
our model.
Our regularized model may have a slightly high bias than linear regression but less variance for future
predictions.
8.CLASSIFICATION
Supervised Machine Learning is where you have input variables (x) and an output variable (Y) and you
use an algorithm to learn the mapping function from the input to the output Y = f(X). The goal is to
approximate the mapping function so well that when you have new input data (x) you can predict the
output variables (Y) for that data.
Supervised learning problems can be further grouped into Regression and Classification problems.
Regression: Regression algorithms are used to predict a continuous numerical output. For
example, a regression algorithm could be used to predict the price of a house based on its size,
location, and other features.
Page 18
Classification: Classification algorithms are used to predict a categorical output. For example, a
classification algorithm could be used to predict whether an email is spam or not.
Classification is a process of categorizing data or objects into predefined classes or categories based on
their features or attributes.
Machine Learning classification is a type of supervised learning technique where an algorithm is trained
on a labeled dataset to predict the class or category of new, unseen data.
The main objective of classification machine learning is to build a model that can accurately assign a
label or category to a new observation based on its features.
For example, a classification model might be trained on a dataset of images labeled as either dogs or cats
and then used to predict the class of new, unseen images of dogs or cats based on their features such as
color, texture, and shape.
Classification Types
Binary Classification
In binary classification, the goal is to classify the input into one of two classes or categories. Example –
On the basis of the given health conditions of a person, we have to determine whether the person has a
certain disease or not.
Multiclass Classification
In multi-class classification, the goal is to classify the input into one of several classes or categories. For
Example – On the basis of data about different species of flowers, we have to determine which specie our
observation belongs to.
Multi-Label Classification
In, Multi-label Classification the goal is to predict which of several labels a new data point belongs to.
This is different from multiclass classification, where each data point can only belong to one class. For
Page 19
example, a multi-label classification algorithm could be used to classify images of animals as belonging
to one or more of the categories cat, dog, bird, or fish.
Imbalanced Classification
In, Imbalanced Classification the goal is to predict whether a new data point belongs to a minority class,
even though there are many more examples of the majority class. For example, a medical diagnosis
algorithm could be used to predict whether a patient has a rare disease, even though there are many more
patients with common diseases.
Classification Algorithms
Linear Classifiers
Linear models create a linear decision boundary between classes. They are simple and computationally
efficient. Some of the linear classification models are as follows:
Logistic Regression
Single-layer Perceptron
Non-linear Classifiers
Non-linear models create a non-linear decision boundary between classes. They can capture more
complex relationships between the input features and the target variable. Some of the non-
linear classification models are as follows:
K-Nearest Neighbours
Kernel SVM
Naive Bayes
Random Forests,
AdaBoost,
Bagging Classifier,
Voting Classifier,
ExtraTrees Classifier
Page 20
In machine learning, classification learners can also be classified as either “lazy” or “eager” learners.
Lazy Learners: Lazy Learners are also known as instance-based learners, lazy learners do not
learn a model during the training phase. Instead, they simply store the training data and use it to
classify new instances at prediction time. It is very fast at prediction time because it does not
require computations during the predictions. it is less effective in high-dimensional spaces or
when the number of training instances is large. Examples of lazy learners include k-nearest
neighbors and case-based reasoning.
Eager Learners: Eager Learners are also known as model-based learners, eager learners learn a
model from the training data during the training phase and use this model to classify new
instances at prediction time. It is more effective in high-dimensional spaces having large training
datasets. Examples of eager learners include decision trees, random forests, and support vector
machines.
Evaluating a classification model is an important step in machine learning, as it helps to assess the
performance and generalization ability of the model on new, unseen data. There are several metrics and
techniques that can be used to evaluate a classification model, depending on the specific problem and
requirements. Here are some commonly used evaluation metrics:
Classification Accuracy: The proportion of correctly classified instances over the total number of
instances in the test set. It is a simple and intuitive metric but can be misleading in imbalanced
datasets where the majority class dominates the accuracy score.
Confusion matrix: A table that shows the number of true positives, true negatives, false
positives, and false negatives for each class, which can be used to calculate various evaluation
metrics.
Precision and Recall: Precision measures the proportion of true positives over the total number
of predicted positives, while recall measures the proportion of true positives over the total number
of actual positives. These metrics are useful in scenarios where one class is more important than
the other, or when there is a trade-off between false positives and false negatives.
F1-Score: The harmonic mean of precision and recall, calculated as 2 x (precision x recall) /
(precision + recall). It is a useful metric for imbalanced datasets where both precision and recall
are important.
ROC curve and AUC: The Receiver Operating Characteristic (ROC) curve is a plot of the true
positive rate (recall) against the false positive rate (1-specificity) for different threshold values of
the classifier’s decision function. The Area Under the Curve (AUC) measures the overall
performance of the classifier, with values ranging from 0.5 (random guessing) to 1 (perfect
classification).
Cross-validation: A technique that divides the data into multiple folds and trains the model on
each fold while testing on the others, to obtain a more robust estimate of the model’s performance.
It is important to choose the appropriate evaluation metric(s) based on the specific problem and
requirements, and to avoid overfitting by evaluating the model on independent test data.
Characteristics of Classification
Page 21
Here are the characteristics of the classification:
Categorical Target Variable: Classification deals with predicting categorical target variables
that represent discrete classes or labels. Examples include classifying emails as spam or not spam,
predicting whether a patient has a high risk of heart disease, or identifying image objects.
Accuracy and Error Rates: Classification models are evaluated based on their ability to
correctly classify data points. Common metrics include accuracy, precision, recall, and F1-score.
Model Complexity: Classification models range from simple linear classifiers to more complex
nonlinear models. The choice of model complexity depends on the complexity of the relationship
between the input features and the target variable.
The basic idea behind classification is to train a model on a labeled dataset, where the input data is
associated with their corresponding output labels, to learn the patterns and relationships between the input
data and output labels. Once the model is trained, it can be used to predict the output labels for new
unseen data.
Before getting started with classification, it is important to understand the problem you are trying to
solve. What are the class labels you are trying to predict? What is the relationship between the input data
and the class labels?
Suppose we have to predict whether a patient has a certain disease or not, on the basis of 7 independent
variables, called features. This means, there can be only two possible outcomes:
Page 22
Data preparation
Once you have a good understanding of the problem, the next step is to prepare your data. This includes
collecting and preprocessing the data and splitting it into training, validation, and test sets. In this step, the
data is cleaned, preprocessed, and transformed into a format that can be used by the classification
algorithm.
X: It is the independent feature, in the form of an N*M matrix. N is the no. of observations and M
is the number of features.
Feature Extraction
The relevant features or attributes are extracted from the data that can be used to differentiate between the
different classes.
Suppose our input X has 7 independent features, having only 5 features influencing the label or target
values remaining 2 are negligibly or not correlated, then we will use only these 5 features only for the
model training.
Model Selection
There are many different models that can be used for classification, including logistic regression,
decision trees, support vector machines (SVM), or neural networks. It is important to select a model
that is appropriate for your problem, taking into account the size and complexity of your data, and the
computational resources you have available.
Model Training
Once you have selected a model, the next step is to train it on your training data. This involves adjusting
the parameters of the model to minimize the error between the predicted class labels and the actual class
labels for the training data.
Model Evaluation
Evaluating the model: After training the model, it is important to evaluate its performance on a validation
set. This will give you a good idea of how well the model is likely to perform on new, unseen data.
Log Loss or Cross-Entropy Loss, Confusion Matrix, Precision, Recall, and AUC-ROC curve are the
quality metrics used for measuring the performance of the model.
If the model’s performance is not satisfactory, you can fine-tune it by adjusting the parameters, or trying a
different model.
Finally, once we are satisfied with the performance of the model, we can deploy it to make predictions on
new data. it can be used for real world problem.
Page 23
Classification algorithms are widely used in many real-world applications across various domains,
including:
Medical diagnosis
Image classification
Sentiment analysis.
Fraud detection
Quality control
Recommendation systems
9.LOGISTIC REGRESSION
o Logistic regression is one of the most popular Machine Learning algorithms, which comes under
the Supervised Learning technique. It is used for predicting the categorical dependent variable
using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and can
easily determine the most effective variables used for the classification. The below image is
showing the logistic function:
Page 24
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is called
logistic regression, but is used to classify samples; Therefore, it falls under the classification algorithm.
o The sigmoid function is a mathematical function used to map the predicted values to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the
logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold
values tends to 0.
The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:
Page 25
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types
of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
Gradient Descent is known as one of the most commonly used optimization algorithms to train machine
learning models by means of minimizing errors between actual and expected results. Further, gradient
descent is also used to train Neural Networks.
The best way to define the local minimum or local maximum of a function using gradient descent is as
follows:
o If we move towards a negative gradient or away from the gradient of the function at the current
point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the function at the
current point, we will get the local maximum of that function.
Page 26
This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The main
objective of using a gradient descent algorithm is to minimize the cost function using iteration. To
achieve this goal, it performs two steps iteratively:
o Calculates the first-order derivative of the function to compute the gradient or slope of that
function.
o Move away from the direction of the gradient, which means slope increased from the current point
by alpha times, where Alpha is defined as Learning Rate. It is a tuning parameter in the
optimization process which helps to decide the length of the steps.
What is Cost-function?
The cost function is defined as the measurement of difference or error between actual values and
expected values at the current position and present in the form of a single real number. It helps to
increase and improve machine learning efficiency by providing feedback to this model so that it can
minimize error and find the local or global minimum. Further, it continuously iterates along the direction
of the negative gradient until the cost function approaches zero. At this steepest descent point, the model
will stop learning further. Although cost function and loss function are considered synonymous, also
there is a minor difference between them. The slight difference between the loss function and the cost
function is about the error within the training of machine learning models, as loss function refers to the
error of one training example, while a cost function calculates the average error across an entire training
set.
The cost function is calculated after making a hypothesis with initial parameters and modifying these
parameters using gradient descent algorithms over known data to reduce the cost function.
Hypothesis:
Parameters:
Cost function:
Goal:
Before starting the working principle of gradient descent, we should know some basic concepts to find
out the slope of a line from linear regression. The equation for simple linear regression is given as:
1. Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
Page 27
The starting point(shown in above fig.) is used to evaluate the performance as it is considered just as an
arbitrary point. At this starting point, we will derive the first derivative or slope and then use a tangent
line to calculate the steepness of this slope. Further, this slope will inform the updates to the parameters
(weights and bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new parameters are
generated, then steepness gradually reduces, and at the lowest point, it approaches the lowest point, which
is called a point of convergence.
The main objective of gradient descent is to minimize the cost function or the error between expected and
actual. To minimize the cost function, two data points are required:
These two factors are used to determine the partial derivative calculation of future iteration and allow it to
the point of convergence or local minimum or global minimum. Let's discuss learning rate factors in
brief;
Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is typically a small value
that is evaluated and updated based on the behavior of the cost function. If the learning rate is high, it
results in larger steps but also leads to risks of overshooting the minimum. At the same time, a low
learning rate shows the small step sizes, which compromises overall efficiency but gives the advantage of
more precision.
Based on the error in various training models, the Gradient Descent learning algorithm can be divided
into Batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. Let's
understand these different types of gradient descent:
Batch gradient descent (BGD) is used to find the error for each point in the training set and update the
model after evaluating all training examples. This procedure is known as the training epoch. In simple
words, it is a greedy approach where we have to sum over all examples for each update.
Page 28
o It is Computationally efficient as all resources are used for all training samples.
Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example per
iteration. Or in other words, it processes a training epoch for each example within a dataset and updates
each training example's parameters one at a time. As it requires only one training example at a time,
hence it is easier to store in allocated memory. However, it shows some computational efficiency losses
in comparison to batch gradient systems as it shows frequent updates that require more detail and speed.
Further, due to frequent updates, it is also treated as a noisy gradient. However, sometimes it can be
helpful in finding the global minimum and also escaping the local minimum.
In Stochastic gradient descent (SGD), learning happens on every example, and it consists of a few
advantages over other gradient descent.
Mini Batch gradient descent is the combination of both batch gradient descent and stochastic gradient
descent. It divides the training datasets into small batch sizes then performs the updates on those batches
separately. Splitting training datasets into smaller batches make a balance to maintain the computational
efficiency of batch gradient descent and speed of stochastic gradient descent. Hence, we can achieve a
special type of gradient descent with higher computational efficiency and less noisy gradient descent.
o It is computationally efficient.
Although we know Gradient Descent is one of the most popular methods for optimization problems, it
still also has some challenges. There are a few challenges as follows:
For convex problems, gradient descent can find the global minimum easily, while for non-convex
problems, it is sometimes difficult to find the global minimum, where the machine learning models
achieve the best results.
Page 29
Whenever the slope of the cost function is at zero or just close to zero, this model stops learning further.
Apart from the global minimum, there occur some scenarios that can show this slop, which is saddle point
and local minimum. Local minima generate the shape similar to the global minimum, where the slope of
the cost function increases on both sides of the current points.
In contrast, with saddle points, the negative gradient only occurs on one side of the point, which reaches a
local maximum on one side and a local minimum on the other side. The name of a saddle point is taken
by that of a horse's saddle.
The name of local minima is because the value of the loss function is minimum at that point in a local
region. In contrast, the name of the global minima is given so because the value of the loss function is
minimum there, globally across the entire domain the loss function.
In a deep neural network, if the model is trained with gradient descent and backpropagation, there can
occur two more issues other than local minima and saddle point.
Vanishing Gradients:
Vanishing Gradient occurs when the gradient is smaller than expected. During backpropagation, this
gradient becomes smaller that causing the decrease in the learning rate of earlier layers than the later layer
of the network. Once this happens, the weight parameters update until they become insignificant.
Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is too large
and creates a stable model. Further, in this scenario, model weight increases, and they will be represented
as NaN. This problem can be solved using the dimensionality reduction technique, which helps to
minimize complexity within the model.
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is
used for Classification as well as Regression problems. However, primarily, it is used for Classification
problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.
Page 30
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified using a decision boundary or
hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose
we see a strange cat that also has some features of dogs, so if we want a model that can accurately
identify whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will
first train our model with lots of images of cats and dogs so that it can learn about different features of
cats and dogs, and then we test it with this strange creature. So as support vector creates a decision
boundary between these two data (cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the
below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.
Page 31
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision boundary that helps to classify the data points. This best
boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there are
2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset
that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that
can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can
be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region
is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called
Page 32
as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is
called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Page 33
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d
space with z=1, then it will become as:
12.KERNAL METHODS
In the realm of machine learning, kernels hold a pivotal role, especially in algorithms designed for
classification and regression tasks like Support Vector Machines (SVMs). The kernel function is the heart
of these algorithms, adept at simplifying the complexity inherent in data. It transforms non-linear
relationships into a linear format, making them accessible for algorithms that traditionally only handle
linear data.
This transformation is important for allowing SVMs to unravel and make sense of complex patterns and
relationships. Kernels achieve this without the computational intensity of mapping data to higher
dimensions explicitly. Their efficiency and effectiveness in revealing hidden patterns make them a
cornerstone in modern machine learning.
As we explore kernels further, we uncover their significance in enhancing the performance and
applicability of SVMs in diverse scenarios.
The concept of a kernel in machine learning offers a compelling and intuitive way to understand this
powerful tool used in Support Vector Machines (SVMs). At its most fundamental level, a kernel is a
relatively straightforward function that operates on two vectors from the input space, commonly referred
to as the X space. The primary role of this function is to return a scalar value, but the fascinating aspect of
this process lies in what this scalar represents and how it is computed.
This scalar is, in essence, the dot product of the two input vectors. However, it's not computed in the
original space of these vectors. Instead, it's as if this dot product is calculated in a much higher-
Page 34
dimensional space, known as the Z space. This is where the kernel's true power and elegance come into
play. It manages to convey how close or similar these two vectors are in the Z space without the
computational overhead of actually mapping the vectors to this higher-dimensional space and calculating
their dot product there.
The kernel thus serves as a kind of guardian of the Z space. It allows you to glean the necessary
information about the vectors in this more complex space without having to access the space directly.
This approach is particularly useful in SVMs, where understanding the relationship and position of
vectors in a higher-dimensional space is crucial for classification tasks.
The "Kernel Trick" is a clever technique in machine learning that allows algorithms, especially those used
in Support Vector Machines (SVMs), to operate in a high-dimensional space without directly computing
the coordinates in that space. The reason it's called a "trick" is because it cleverly circumvents the
computationally intensive task of mapping data points into a higher-dimensional space, which is often
necessary for making complex, non-linear classifications.
Firstly, a kernel takes the data from its original space and implicitly maps it to a higher-dimensional
space. This is crucial when dealing with data that is not linearly separable in its original form. Instead of
performing computationally expensive high-dimensional calculations, the kernel function calculates the
relationships or similarities between pairs of data points as if they were in this higher-dimensional space.
This calculation of similarities is fundamental to how kernels facilitate complex classifications. In the
context of SVMs, for instance, the kernel function computes the dot product of input data pairs in the
transformed space. This process effectively determines the relationships between data points, allowing
the SVM to find separating hyperplanes (boundaries) that can categorize data points into different classes,
even when such categorization isn't apparent in the original space.
Choosing the right kernel for a machine learning task, such as in Support Vector Machines (SVMs), is a
critical decision that can significantly impact the performance of the model. The selection process
involves understanding both the nature of the data and the specific requirements of the task at hand.
Page 35
Firstly, it's important to consider the distribution and structure of the data. If the data is linearly separable,
a linear kernel may be sufficient. However, for more complex, non-linear data, a polynomial or radial
basis function (RBF) kernel might be more appropriate.
The polynomial kernel, for example, is effective for datasets where the relationship between variables is
not merely linear but involves higher-degree interactions. On the other hand, the RBF kernel, often a go-
to choice, is particularly useful for datasets where the decision boundary is not clear, and the data points
form more of a cloud-like structure.
Another crucial aspect is the tuning of kernel parameters, which can drastically influence the model's
accuracy. For instance, in the RBF kernel, the gamma parameter defines how far the influence of a single
training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The correct
setting of these parameters often requires iterative experimentation and cross-validation to avoid
overfitting and underfitting.
In practical scenarios, it's also advisable to consider computational efficiency. Some kernels might lead to
quicker convergence and less computational overhead, which is essential in large-scale applications or
when working with vast datasets.
Lastly, domain knowledge can play a significant role in kernel selection. Understanding the underlying
phenomena or patterns in the data can guide the choice of the kernel. For example, in text classification or
natural language processing, certain kernels might be more effective in capturing the linguistic structures
and nuances.
Instance-based learning
The Machine Learning systems which are categorized as instance-based learning are the systems that
learn the training examples by heart and then generalizes to new instances based on some similarity
measure. It is called instance-based because it builds the hypotheses from the training instances. It is also
known as memory-based learning or lazy-learning (because they delay processing until a new instance
must be classified). The time complexity of this algorithm depends upon the size of training data. Each
time whenever a new query is encountered, its previously stores data is examined. And assign to a target
function value for the new instance.
The worst-case time complexity of this algorithm is O (n), where n is the number of training instances.
For example, If we were to create a spam filter with an instance-based learning algorithm, instead of just
flagging emails that are already marked as spam emails, our spam filter would be programmed to also
flag emails that are very similar to them. This requires a measure of resemblance between two emails. A
similarity measure between two emails could be the same sender or the repetitive use of the same
keywords or something else.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made to the target
function.
2. This algorithm can adapt to new data easily, one which is collected as we go .
Disadvantages:
Page 36
1. Classification costs are high
2. Large amount of memory required to store the data, and each query involves starting the
identification of a local model from scratch.
5. Case-Based Reasoning
14.K-NEAREST NEIGHBORS
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put the
new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for
the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action on
the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want
to know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it
works on a similarity measure. Our KNN model will find the similar features of the new data set
to the cats and dogs images and based on the most similar features it will put it in either cat or dog
category.
Page 37
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so
this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:
The K-NN working can be explained on the basis of the below algorithm:
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Suppose we have a new data point and we need to put it in the required category. Consider the below
image:
Page 38
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is
the distance between two points, which we have already studied in geometry. It can be calculated
as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must belong
to category A.
Below are some points to remember while selecting the value of K in the K-NN algorithm:
Page 39
o There is no particular way to determine the best value for "K", so we need to try some values to
find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
o Large values for K are good, but it may find some difficulties.
o It is simple to implement.
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for all
the training samples.
Tree-based machine learning methods are among the most commonly used supervised learning methods.
They are constructed by two entities; branches and nodes. Tree-based ML methods are built by
recursively splitting a training sample, using different features from a dataset at each node that splits the
data most effectively. The splitting is based on learning simple decision rules inferred from the training
data.
Generally, tree-based ML methods are simple and intuitive; to predict a class label or value, we start from
the top of the tree or the root and, using branches, go to the nodes by comparing features on the basis of
which will provide the best split.
Tree-based methods also use the mean for continuous variables or mode for categorical variables when
making predictions on training observations in the regions they belong to.
Since the set of rules used to segment the predictor space can be summarized in a visual representation
with branches that show all the possible outcomes, these approaches are commonly referred to as decision
tree methods.
The methods are flexible and can be applied to either classification or regression problems. Classification
and Regression Trees (CART) is a commonly used term by Leo Breiman, referring to the flexibility of
the methods in solving both linear and non-linear predictive modeling problems.
Decision trees can be classified based on the type of target or response variable.
i. Classification Trees
The default type of decision trees, used when the response variable is categorical—i.e. predicting whether
a team will win or lose a game.
Page 40
ii. Regression Trees
Used when the target variable is continuous or numerical in nature—i.e. predicting house prices based on
year of construction, number of rooms, etc.
1. Interpretability: Decision tree methods are easy to understand even for non-technical people.
2. The data type isn’t a constraint, as the methods can handle both categorical and numerical
variables.
3. Data exploration — Decision trees help us easily identify the most significant variables and their
correlation.
1. Large decision trees are complex, time-consuming and less accurate in predicting outcomes.
2. Decision trees don’t fit well for continuous variables, as they lose important information when
segmenting the data into different regions.
Common Terminology
i) Root node — this represents the entire population or the sample, which gets divided into two or more
homogenous subsets.
iii) Decision node — this is when a sub-node is divided into further sub-nodes.
iv) Leaf/Terminal node — this is the final/last node that we consider for our model output. It cannot be
split further.
vii) Parent and Child node — a node that’s subdivided into a sub-node is a parent, while the sub-node is
the child node.
Page 41
Algorithms in Tree-based Machine Learning Models
The decision of splitting a tree affects its accuracy. Tree-based machine learning models use multiple
algorithms to decide where to split a node into two or more sub-nodes. The creation of sub-nodes
increases the homogeneity of the resultant sub-nodes. Algorithm selection is based on the type of target
variable.
Suppose you’re the basketball coach of a grade school. The inter-school basketball competitions are
nearby and you want to do a survey to determine which students play basketball in their leisure time. The
sample selected is 40 students. The selection criterion is based on a number of factors such as gender,
height, and class.
As a coach, you’d want to select the students based on the most significant input variable among the three
variables.
Decision tree algorithms will help the coach identify the right sample of students using the variable,
which creates the best homogenous set of student players.
16.DECISION TREE
o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision based
on given conditions.
Page 42
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree
into subtrees.
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the two
reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.
Page 43
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of
the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and,
based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be
better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept
the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by
ASM). The root node splits further into the next decision node (distance from the office) and one leaf
node based on the corresponding labels. The next decision node further gets split into one decision node
(Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram:
While implementing a Decision tree, the main issue arises that how to select the best attribute for the root
node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes of
the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
Page 44
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data.
Entropy can be calculated as:
Where,
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision
tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without reducing
accuracy is known as Pruning. There are mainly two types of tree pruning technology used:
Page 45
o It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
16.ID3
In the realm of machine learning and data mining, decision trees stand as versatile tools for classification
and prediction tasks. The ID3 (Iterative Dichotomiser 3) algorithm serves as one of the foundational
pillars upon which decision tree learning is built. Developed by Ross Quinlan in the 1980s, ID3 remains a
fundamental algorithm, forming the basis for subsequent tree-based methods like C4.5
and CART (Classification and Regression Trees).
Machine learning models called decision trees divide the input data recursively according to features to
arrive at a decision. Every internal node symbolizes a feature, and every branch denotes a potential result
of that feature. It is simple to interpret and visualize thanks to the tree structure. Every leaf node makes a
judgment call or forecast. To optimize information acquisition or limit impurity, the best feature is chosen
at each stage of creation. Decision trees are adaptable and can be used for both regression and
classification applications. Although they can overfit, this is frequently avoided by employing strategies
like pruning.
Decision Trees
Before delving into the intricacies of the ID3 algorithm, let’s grasp the essence of decision trees. Picture a
tree-like structure where each internal node represents a test on an attribute, each branch signifies an
outcome of that test, and each leaf node denotes a class label or a decision. Decision trees mimic human
decision-making processes by recursively splitting data based on different attributes to create a flowchart-
like structure for classification or regression.
ID3 Algorithm
A well-known decision tree approach for machine learning is the Iterative Dichotomiser 3 (ID3)
algorithm. By choosing the best characteristic at each node to partition the data depending on information
gain, it recursively constructs a tree. The goal is to make the final subsets as homogeneous as possible. By
choosing features that offer the greatest reduction in entropy or uncertainty, ID3 iteratively grows the
tree. The procedure keeps going until a halting requirement is satisfied, like a minimum subset size or a
Page 46
maximum tree depth. Although ID3 is a fundamental method, other iterations such as C4.5 and CART
have addresse
The ID3 algorithm is specifically designed for building decision trees from a given dataset. Its primary
objective is to construct a tree that best explains the relationship between attributes in the data and their
corresponding class labels.
ID3 employs the concept of entropy and information gain to determine the attribute that best
separates the data. Entropy measures the impurity or randomness in the dataset.
The algorithm calculates the entropy of each attribute and selects the one that results in the most
significant information gain when used for splitting the data.
The chosen attribute is used to split the dataset into subsets based on its distinct values.
For each subset, ID3 recurses to find the next best attribute to further partition the data, forming
branches and new nodes accordingly.
3. Stopping Criteria
The recursion continues until one of the stopping criteria is met, such as when all instances in a
branch belong to the same class or when all attributes have been used for splitting.
ID3 can handle missing attribute values by employing various strategies like attribute mean/mode
substitution or using majority class values.
5. Tree Pruning
Pruning is a technique to prevent overfitting. While not directly included in ID3, post-processing
techniques or variations like C4.5 incorporate pruning to improve the tree’s generalization.
Now let’s examine the formulas linked to the main theoretical ideas in the ID3 algorithm:
1. Entropy
A measure of disorder or uncertainty in a set of data is called entropy. Entropy is a tool used in ID3 to
measure a dataset’s disorder or impurity. By dividing the data into as homogenous subsets as feasible, the
objective is to minimize entropy.
For a set S with classes {c1, c2, …, cn}, the entropy is calculated as:
Page 47
2. Information Gain
A measure of how well a certain quality reduces uncertainty is called Information Gain. ID3 splits the
data at each stage, choosing the property that maximizes Information Gain. It is computed using the
distinction between entropy prior to and following the split.
Where, |Sv | is the size of the subset of S for which attribute A has value v.
3. Gain Ratio
Gain Ratio is an improvement on Information Gain that considers the inherent worth of characteristics
that have a wide range of possible values. It deals with the bias of Information Gain in favor of
characteristics with more pronounced values.
Advantages
Interpretability: Decision trees generated by ID3 are easily interpretable, making them suitable
for explaining decisions to non-technical stakeholders.
Handles Categorical Data: ID3 can effectively handle categorical attributes without requiring
explicit data preprocessing steps.
Limitations
Overfitting: ID3 tends to create complex trees that may overfit the training data, impacting
generalization to unseen instances.
Sensitive to Noise: Noise or outliers in the data can lead to the creation of non-optimal or
incorrect splits.
Binary Trees Only: ID3 constructs binary trees, limiting its ability to represent more complex
relationships present in the data directly.
17.CART
CART( Classification And Regression Trees) is a variation of the decision tree algorithm. It can
handle both classification and regression tasks. Scikit-Learn uses the Classification And Regression Tree
(CART) algorithm to train Decision Trees (also called “growing” trees). CART was first produced by
Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in 1984.
Page 48
CART(Classification And Regression Tree) for Decision Tree
CART is a predictive algorithm used in Machine learning and it explains how the target variable’s values
can be predicted based on other matters. It is a decision tree where each fork is split into a predictor
variable and each node has a prediction for the target variable at the end.
The term CART serves as a generic term for the following categories of decision trees:
Classification Trees: The tree is used to determine which “class” the target variable is most likely
to fall into when it is continuous.
In the decision tree, nodes are split into sub-nodes based on a threshold value of an attribute. The root
node is taken as the training set and is split into two by considering the best attribute and threshold value.
Further, the subsets are also split using the same logic. This continues till the last pure sub-set is found in
the tree or the maximum number of leaves possible in that growing tree.
CART Algorithm
Classification and Regression Trees (CART) is a decision tree algorithm that is used for both
classification and regression tasks. It is a supervised learning algorithm that learns from labelled data to
predict unseen data.
Tree structure: CART builds a tree-like structure consisting of nodes and branches. The nodes
represent different decision points, and the branches represent the possible outcomes of those
decisions. The leaf nodes in the tree contain a predicted class label or value for the target variable.
Splitting criteria: CART uses a greedy approach to split the data at each node. It evaluates all
possible splits and selects the one that best reduces the impurity of the resulting subsets. For
classification tasks, CART uses Gini impurity as the splitting criterion. The lower the Gini
impurity, the more pure the subset is. For regression tasks, CART uses residual reduction as the
splitting criterion. The lower the residual reduction, the better the fit of the model to the data.
Pruning: To prevent overfitting of the data, pruning is a technique used to remove the nodes that
contribute little to the model accuracy. Cost complexity pruning and information gain pruning are
two popular pruning techniques. Cost complexity pruning involves calculating the cost of each
node and removing nodes that have a negative cost. Information gain pruning involves calculating
the information gain of each node and removing nodes that have a low information gain.
Based on the best-split points of each input in Step 1, the new “best” split point is identified.
Continue splitting until a stopping rule is satisfied or no further desirable splitting is available.
Page 49
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by searching for
the best homogeneity for the sub nodes, with the help of the Gini index criterion.
The Gini index is a metric for the classification tasks in CART. It stores the sum of squared probabilities
of each class. It computes the degree of probability of a specific variable that is wrongly being classified
when chosen randomly and a variation of the Gini coefficient. It works on categorical variables, provides
outcomes either “successful” or “failure” and hence conducts binary splitting only.
Where 0 depicts that all the elements are allied to a certain class, or only one class exists there.
The Gini index of value 1 signifies that all the elements are randomly distributed across various
classes, and
A value of 0.5 denotes the elements are uniformly distributed into some classes.
A classification tree is an algorithm where the target variable is categorical. The algorithm is then used to
identify the “Class” within which the target variable is most likely to fall. Classification trees are used
when the dataset needs to be split into classes that belong to the response variable(like yes or no)
For classification in decision tree learning algorithm that creates a tree-like structure to predict class
labels. The tree consists of nodes, which represent different decision points, and branches, which
represent the possible result of those decisions. Predicted class labels are present at each leaf node of the
tree.
CART for classification works by recursively splitting the training data into smaller and smaller subsets
based on certain criteria. The goal is to split the data in a way that minimizes the impurity within each
subset. Impurity is a measure of how mixed up the data is in a particular subset. For classification tasks,
CART uses Gini impurity
Page 50
Gini Impurity- Gini impurity measures the probability of misclassifying a random instance from
a subset labeled according to the majority class. Lower Gini impurity means more purity of the
subset.
Splitting Criteria- The CART algorithm evaluates all potential splits at every node and chooses
the one that best decreases the Gini impurity of the resultant subsets. This process continues until
a stopping criterion is reached, like a maximum tree depth or a minimum number of instances in a
leaf node.
A Regression tree is an algorithm where the target variable is continuous and the tree is used to predict its
value. Regression trees are used when the response variable is continuous. For example, if the response
variable is the temperature of the day.
CART for regression is a decision tree learning method that creates a tree-like structure to predict
continuous target variables. The tree consists of nodes that represent different decision points and
branches that represent the possible outcomes of those decisions. Predicted values for the target variable
are stored in each leaf node of the tree.
Regression CART works by splitting the training data recursively into smaller subsets based on specific
criteria. The objective is to split the data in a way that minimizes the residual reduction in each subset.
Residual Reduction- Residual reduction is a measure of how much the average squared
difference between the predicted values and the actual values for the target variable is reduced by
splitting the subset. The lower the residual reduction, the better the model fits the data.
Splitting Criteria- CART evaluates every possible split at each node and selects the one that
results in the greatest reduction of residual error in the resulting subsets. This process is repeated
until a stopping criterion is met, such as reaching the maximum tree depth or having too few
instances in a leaf node.
CART models are formed by picking input variables and evaluating split points on those variables until
an appropriate tree is produced.
Greedy algorithm: In this The input space is divided using the Greedy method which is known as
a recursive binary spitting. This is a numerical method within which all of the values are aligned
and several other split points are tried and assessed using a cost function.
Stopping Criterion: As it works its way down the tree with the training data, the recursive binary
splitting method described above must know when to stop splitting. The most frequent halting
method is to utilize a minimum amount of training data allocated to every leaf node. If the count is
smaller than the specified threshold, the split is rejected and also the node is considered the last
leaf node.
Tree pruning: Decision tree’s complexity is defined as the number of splits in the tree. Trees with
fewer branches are recommended as they are simple to grasp and less prone to cluster the data.
Page 51
Working through each leaf node in the tree and evaluating the effect of deleting it using a hold-out
test set is the quickest and simplest pruning approach.
Data preparation for the CART: No special data preparation is required for the CART
algorithm.
Advantages of CART
Limitations of CART
Overfitting.
High Variance.
low bias.
18.ENSEMBLE METHODS
Ensemble learning helps improve machine learning results by combining several models. This approach
allows the production of better predictive performance compared to a single model. Basic idea is to learn
a set of classifiers (experts) and to allow them to vote.
Page 52
Why do ensembles work?
Statistical Problem –
The Statistical Problem arises when the hypothesis space is too large for the amount of available
data. Hence, there are many hypotheses with the same accuracy on the data and the learning
algorithm chooses only one of them! There is a risk that the accuracy of the chosen hypothesis is
low on unseen data!
Computational Problem –
The Computational Problem arises when the learning algorithm cannot guarantees finding the best
hypothesis.
Representational Problem –
The Representational Problem arises when the hypothesis space does not contain any good
approximation of the target class(es).
The main challenge is not to obtain highly accurate base models, but rather to obtain base models which
make different kinds of errors. For example, if ensembles are used for classification, high accuracies can
be accomplished if different base models misclassify different training examples, even if the base
classifier accuracy is low.
Majority Vote
Randomness Injection
Feature-Selection Ensembles
Boosting
Stacking
Page 53
Reliable Classification: Meta-Classifier Approach
Co-Training and Self-Training
Bagging:
Bagging (Bootstrap Aggregation) is used to reduce the variance of a decision tree. Suppose a set D of d
tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e.,
bootstrap). Then a classifier model Mi is learned for each training set D < i. Each classifier Mi returns its
class prediction. The bagged classifier M* counts the votes and assigns the class with the most votes to X
(unknown sample).
1. Multiple subsets are created from the original data set with equal tuples, selecting observations
with replacement.
3. Each model is learned in parallel from each training set and independent of each other.
4. The final predictions are determined by combining the predictions from all the models.
Random Forest:
Random Forest is an extension over bagging. Each classifier in the ensemble is a decision tree classifier
and is generated using a random selection of attributes at each node to determine the split. During
classification, each tree votes and the most popular class is returned.
1. Multiple subsets are created from the original data set, selecting observations with
replacement.
2. A subset of features is selected randomly and whichever feature gives the best split is used
to split the node iteratively.
4. Repeat the above steps and prediction is given based on the aggregation of predictions
from n number of trees.
Page 54
19.RANDOM FOREST
Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique.
It can be used for both Classification and Regression problems in ML. It is based on the concept
of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem
and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree
and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
The below diagram explains the working of the Random Forest algorithm:
Note: To better understand the Random Forest Algorithm, you should have knowledge of the Decision
Tree Algorithm.
Since the random forest combines multiple trees to predict the class of the dataset, it is possible that some
decision trees may predict the correct output, while others may not. But together, all the trees predict the
correct output. Therefore, below are two assumptions for a better Random forest classifier:
o There should be some actual values in the feature variable of the dataset so that the classifier can
predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Page 55
Why use Random Forest?
Below are some points that explain why we should use the Random Forest algorithm:
<="" li="">
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
Random Forest works in two-phase first is to create the random forest by combining N decision tree, and
second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to
the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into subsets and given to each decision tree. During the
training phase, each decision tree produces a prediction result, and when a new data point occurs, then
based on the majority of results, the Random Forest classifier predicts the final decision. Consider the
below image:
Page 56
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
o It enhances the accuracy of the model and prevents the overfitting issue.
o Although random forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.
Once our model is completed, it is necessary to evaluate its performance; either it is a Classification or
Regression model. So for evaluating a Classification model, we have the following ways:
o It is used for evaluating the performance of a classifier, whose output is a probability value
between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
1. ?(ylog(p)+(1?y)log(1?p))
2. Confusion Matrix:
o The confusion matrix provides us a matrix/table as output and describes the performance of the
model.
o The matrix consists of predictions result in a summarized form, which has a total number of
correct predictions and incorrect predictions. The matrix looks like as below table:
Page 57
3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area
Under the Curve.
o It is a graph that shows the performance of the classification model at different thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-ROC
Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and
FPR(False Positive Rate) on X-axis.
Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:
o Speech Recognition
o Drugs Classification
Page 58