KEMBAR78
Regression Analysis | PDF | Support Vector Machine | Linear Regression
0% found this document useful (0 votes)
14 views40 pages

Regression Analysis

Unit II covers regression and classification in machine learning, focusing on various algorithms such as linear regression, decision trees, and classification methods like support vector machines and Naïve Bayes. It explains the importance of regression analysis, its types, assumptions, and applications across different fields. The document also highlights the advantages and disadvantages of linear regression and decision trees, emphasizing their roles in predictive modeling and decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views40 pages

Regression Analysis

Unit II covers regression and classification in machine learning, focusing on various algorithms such as linear regression, decision trees, and classification methods like support vector machines and Naïve Bayes. It explains the importance of regression analysis, its types, assumptions, and applications across different fields. The document also highlights the advantages and disadvantages of linear regression and decision trees, emphasizing their roles in predictive modeling and decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Unit II

Regression and classification

Regression: Linear Regression-Simple-Multiple -Decision Tree-Pruning: Introduction


Representation-Algorithm- issues Classification: Support Vector machine – Naïve Bayes
Applications.


 Machine Learning is a branch of Artificial intelligence that focuses on the
development of algorithms and statistical models that can learn from and make
predictions on data.

 Regression is also a type of machine-learning algorithm more specifically


a supervised machine-learning algorithm that learns from the labelled datasets and
maps the data points to the most optimized linear functions. which can be used for
prediction on new datasets.

 First of we should know what supervised machine learning algorithms is. It is a type
of machine learning where the algorithm learns from labelled data. Labeled data
means the dataset whose respective target value is already known.

Supervised learning has two types:

 Classification: It predicts the class of the dataset based on the independent input
variable. Class is the categorical or discrete values. like the image of an animal is
a cat or dog?

 Regression: It predicts the continuous output variables based on the independent


input variable. like the prediction of house prices based on different parameters
like house age, distance from the main road, location, area, etc.

What is Regression Analysis?


Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables.

It can be utilized to assess the strength of the relationship between variables and for
modeling the future relationship between them.

Regression analysis includes several variations, such as linear, multiple linear, and nonlinear.
The most common models are simple linear and multiple linear. Nonlinear regression
analysis is commonly used for more complicated data sets in which the dependent and
independent variables show a nonlinear relationship.

Regression analysis offers numerous applications in various disciplines, including finance.

What is Linear Regression?

 Linear regression is a type of supervised machine learning algorithm that


computes the linear relationship between the dependent variable and one or
more independent features by fitting a linear equation to observed data.

 When there is only one independent feature, it is known as Simple Linear


Regression, and when there are more than one feature, it is known
as Multiple Linear Regression.

 Similarly, when there is only one dependent variable, it is


considered Univariate Linear Regression, while when there are more than
one dependent variables, it is known as Multivariate Regression.
Linear Regression
 Linear regression is a type of supervised machine learning algorithm that computes
the linear relationship between a dependent variable and one or more independent
features. Linear regression is used to predict continuous output variables based on the
input features.

 There are different types of linear regression, such as simple linear regression,
multiple linear regression, and polynomial linear regression. The main difference is
the number and degree of the features used to fit the linear model.

Why Linear Regression is Important?

 The interpretability of linear regression is a notable strength. The model’s equation


provides clear coefficients that elucidate the impact of each independent variable on
the dependent variable, facilitating a deeper understanding of the underlying
dynamics. Its simplicity is a virtue, as linear regression is transparent, easy to
implement, and serves as a foundational concept for more complex algorithms.

 Linear regression is not merely a predictive tool; it forms the basis for various
advanced models. Techniques like regularization and support vector machines draw
inspiration from linear regression, expanding its utility. Additionally, linear
regression is a cornerstone in assumption testing, enabling researchers to validate
key assumptions about the data.

Linear Model Assumptions

Linear regression analysis is based on six fundamental assumptions:

1. The dependent and independent variables show a linear relationship between the slope
and the intercept.
2. The independent variable is not random.
3. The value of the residual (error) is zero.
4. The value of the residual (error) is constant across all observations.
5. The value of the residual (error) is not correlated across all observations.
6. The residual (error) values follow the normal distribution.
Types of Linear Regression

There are two main types of linear regression:

Simple Linear Regression

Simple linear regression is a model that assesses the relationship between a dependent
variable and an independent variable. The simple linear model is expressed using the
following equation:

Y = a + bX + ϵ

Where:

 Y – Dependent variable
 X – Independent (explanatory) variable
 a – Intercept
 b – Slope
 ϵ – Residual (error)

 The goal of the algorithm is to find the best Fit Line equation that can predict the
values based on the independent variables.

 In regression set of records are present with X and Y values and these values are
used to learn a function so if you want to predict Y from an unknown X this learned
function can be used. In regression we have to find the value of Y, So, a function is
required that predicts continuous Y in the case of regression given X as independent
features.

Assumptions of Simple Linear Regression

Linear regression is a powerful tool for understanding and predicting the behavior of a
variable, however, it needs to meet a few conditions in order to be accurate and dependable
solutions.
1. Linearity: The independent and dependent variables have a linear relationship
with one another. This implies that changes in the dependent variable follow
those in the independent variable(s) in a linear fashion. This means that there
should be a straight line that can be drawn through the data points. If the
relationship is not linear, then linear regression will not be an accurate model.

2. Independence: The observations in the dataset are independent of each other.


This means that the value of the dependent variable for one observation does not
depend on the value of the dependent variable for another observation. If the
observations are not independent, then linear regression will not be an accurate
model.

3. Homoscedasticity: Across all levels of the independent variable(s), the variance


of the errors is constant. This indicates that the amount of the independent
variable(s) has no impact on the variance of the errors. If the variance of the
residuals is not constant, then linear regression will not be an accurate model.

Homoscedasticity in Linear Regression

4. Normality: The residuals should be normally distributed. This means that the
residuals should follow a bell-shaped curve. If the residuals are not normally
distributed, then linear regression will not be an accurate model.

Multiple linear regression


Multiple linear regression analysis is essentially similar to the simple linear model, with the
exception that multiple independent variables are used in the model. The mathematical
representation of multiple linear regression is:

Y = a + bX1 + cX2 + dX3 + ϵ

Where:

 Y – Dependent variable
 X1, X2, X3 – Independent (explanatory) variables
 a – Intercept
 b, c, d – Slopes
 ϵ – Residual (error)

Multiple linear regression follows the same conditions as the simple linear model. However,
since there are several independent variables in multiple linear analysis, there is another
mandatory condition for the model:

 Non-collinearity: Independent variables should show a minimum correlation with


each other. If the independent variables are highly correlated with each other, it will
be difficult to assess the true relationships between the dependent and independent
variables.
Assumptions of Multiple Linear Regression
For Multiple Linear Regression, all four of the assumptions from Simple Linear Regression
apply. In addition to this, below are few more:
1. No multicollinearity: There is no high correlation between the independent
variables. This indicates that there is little or no correlation between the
independent variables. Multicollinearity occurs when two or more independent
variables are highly correlated with each other, which can make it difficult to
determine the individual effect of each variable on the dependent variable. If
there is multicollinearity, then multiple linear regression will not be an accurate
model.
2. Additivity: The model assumes that the effect of changes in a predictor variable
on the response variable is consistent regardless of the values of the other
variables. This assumption implies that there is no interaction between variables
in their effects on the dependent variable.
3. Feature Selection: In multiple linear regression, it is essential to carefully select
the independent variables that will be included in the model. Including irrelevant
or redundant variables may lead to overfitting and complicate the interpretation
of the model.
4. Overfitting: Overfitting occurs when the model fits the training data too
closely, capturing noise or random fluctuations that do not represent the true
underlying relationship between variables. This can lead to poor generalization
performance on new, unseen data.

Multicollinearity
Multicollinearity is a statistical phenomenon that occurs when two or more independent
variables in a multiple regression model are highly correlated, making it difficult to assess
the individual effects of each variable on the dependent variable.
Detecting Multicollinearity includes two techniques:
 Correlation Matrix: Examining the correlation matrix among the independent
variables is a common way to detect multicollinearity. High correlations (close
to 1 or -1) indicate potential multicollinearity.
 VIF (Variance Inflation Factor): VIF is a measure that quantifies how much
the variance of an estimated regression coefficient increases if your predictors
are correlated. A high VIF (typically above 10) suggests multicollinearity.

What is the best Fit Line?


 Our primary objective while using linear regression is to locate the best-fit line,
which implies that the error between the predicted and actual values should be kept
to a minimum. There will be the least error in the best-fit line.

 The best Fit Line equation provides a straight line that represents the relationship
between the dependent and independent variables. The slope of the line indicates
how much the dependent variable changes for a unit change in the independent
variable(s).

Linear Regression


Here Y is called a dependent or target variable and X is called an independent
variable also known as the predictor of Y. There are many types of functions or
modules that can be used for regression. A linear function is the simplest type of
function. Here, X may be a single feature or multiple features representing the
problem.

 Linear regression performs the task to predict a dependent variable value (y) based
on a given independent variable (x)). Hence, the name is Linear Regression. In the
figure above, X (input) is the work experience and Y (output) is the salary of a
person. The regression line is the best-fit line for our model.
 We utilize the cost function to compute the best values in order to get the best fit
line since different values for weights or the coefficient of lines result in different
regression lines.
Applications of Linear Regression

 Linear regression is used in many different fields, including finance, economics, and
psychology, to understand and predict the behavior of a particular variable.

 For example, in finance, linear regression might be used to understand the


relationship between a company’s stock price and its earnings or to predict the
future value of a currency based on its past performance.

Advantages of Linear Regression


 Linear regression is a relatively simple algorithm, making it easy to understand
and implement. The coefficients of the linear regression model can be
interpreted as the change in the dependent variable for a one-unit change in the
independent variable, providing insights into the relationships between
variables.

 Linear regression is computationally efficient and can handle large datasets


effectively. It can be trained quickly on large datasets, making it suitable for
real-time applications.
 Linear regression is relatively robust to outliers compared to other machine
learning algorithms. Outliers may have a smaller impact on the overall model
performance.

 Linear regression often serves as a good baseline model for comparison with
more complex machine learning algorithms.

 Linear regression is a well-established algorithm with a rich history and is


widely available in various machine learning libraries and software packages.

Disadvantages of Linear Regression


 Linear regression assumes a linear relationship between the dependent and
independent variables. If the relationship is not linear, the model may not
perform well.

 Linear regression is sensitive to multicollinearity, which occurs when there is a


high correlation between independent variables. Multicollinearity can inflate the
variance of the coefficients and lead to unstable model predictions.

 Linear regression assumes that the features are already in a suitable form for the
model. Feature engineering may be required to transform features into a format
that can be effectively used by the model.

 Linear regression is susceptible to both overfitting and underfitting. Overfitting


occurs when the model learns the training data too well and fails to generalize to
unseen data. Underfitting occurs when the model is too simple to capture the
underlying relationships in the data.

 Linear regression provides limited explanatory power for complex relationships


between variables. More advanced machine learning techniques may be
necessary for deeper insights.
Decision tree

 Decision trees are a popular and powerful tool used in various fields
such as machine learning, data mining, and statistics. They provide a
clear and intuitive way to make decisions based on data by modeling the
relationships between different variables.

 A decision tree is a flowchart-like structure used to make decisions or


predictions. It consists of nodes representing decisions or tests on
attributes, branches representing the outcome of these decisions, and leaf
nodes representing final outcomes or predictions.

 Each internal node corresponds to a test on an attribute, each branch


corresponds to the result of the test, and each leaf node corresponds to a
class label or a continuous value.

Structure of a Decision Tree

1. Root Node: Represents the entire dataset and the initial decision to be made.
2. Internal Nodes: Represent decisions or tests on attributes. Each internal node
has one or more branches.
3. Branches: Represent the outcome of a decision or test, leading to another node.
4. Leaf Nodes: Represent the final decision or prediction. No further splits occur at
these nodes.
How Decision Trees Work?

The process of creating a decision tree involves:


1. Selecting the Best Attribute: Using a metric like Gini impurity, entropy, or
information gain, the best attribute to split the data is selected.
2. Splitting the Dataset: The dataset is split into subsets based on the selected
attribute.
3. Repeating the Process: The process is repeated recursively for each subset,
creating a new internal node or leaf node until a stopping criterion is met (e.g.,
all instances in a node belong to the same class or a predefined depth is
reached).

Metrics for Splitting


 Gini Impurity: Measures the likelihood of an incorrect classification of a new
instance if it was randomly classified according to the distribution of classes in
the dataset.
 Gini=1–∑𝑖=1𝑛(𝑝𝑖)2Gini=1–∑i=1n(pi)2, where pi is the probability
of an instance being classified into a particular class.

 Entropy: Measures the amount of uncertainty or impurity in the dataset.


 Entropy=−∑𝑖=1𝑛𝑝𝑖log⁡2(𝑝𝑖)Entropy=−∑i=1npilog2(pi), where pi is
the probability of an instance being classified into a particular class.

 Information Gain: Measures the reduction in entropy or Gini impurity after a


dataset is split on an attribute.
 InformationGain=Entropyparent–
∑𝑖=1𝑛(∣𝐷𝑖∣∣𝐷∣∗Entropy(𝐷𝑖))InformationGain=Entropyparent–
∑i=1n(∣D∣∣Di∣∗Entropy(Di)), where Di is the subset of D after
splitting by an attribute.

Applications of Decision Trees

 Business Decision Making: Used in strategic planning and resource allocation.


 Healthcare: Assists in diagnosing diseases and suggesting treatment plans.
 Finance: Helps in credit scoring and risk assessment.
 Marketing: Used to segment customers and predict customer behavior.

Advantages of Decision Trees

 Simplicity and Interpretability: Decision trees are easy to understand and


interpret. The visual representation closely mirrors human decision-making
processes.
 Versatility: Can be used for both classification and regression tasks.
 No Need for Feature Scaling: Decision trees do not require normalization or
scaling of the data.
 Handles Non-linear Relationships: Capable of capturing non-linear
relationships between features and target variables.

Disadvantages of Decision Trees


 Overfitting: Decision trees can easily overfit the training data, especially if they
are deep with many nodes.
 Instability: Small variations in the data can result in a completely different tree
being generated.
 Bias towards Features with More Levels: Features with more levels can
dominate the tree structure.
Pruning
Introduction:
To overcome overfitting, pruning techniques are used. Pruning reduces the size of the tree
by removing nodes that provide little power in classifying instances. There are two main
types of pruning:
 Pre-pruning (Early Stopping): Stops the tree from growing once it meets
certain criteria (e.g., maximum depth, minimum number of samples per leaf).
 Post-pruning: Removes branches from a fully grown tree that do not provide
significant power.

Pruning decision trees


Decision tree pruning is a critical technique in machine learning used to optimize decision
tree models by reducing overfitting and improving generalization to new data. In this guide,
we’ll explore the importance of decision tree pruning, its types, implementation, and its
significance in machine learning model optimization.
What is Decision Tree Pruning?

Decision tree pruning is a technique used to prevent decision trees from overfitting the
training data. Pruning aims to simplify the decision tree by removing parts of it that do not
provide significant predictive power, thus improving its ability to generalize to new data.
Decision Tree Pruning removes unwanted nodes from the overfitted decision tree to make it
smaller in size which results in more fast, more accurate and more effective predictions.
Types Of Decision Tree Pruning
There are two main types of decision tree pruning: Pre-Pruning and Post-Pruning.

Pre-Pruning (Early Stopping)

Sometimes, the growth of the decision tree can be stopped before it gets too complex, this
is called pre-pruning. It is important to prevent the overfitting of the training data, which
results in a poor performance when exposed to new data.
Some common pre-pruning techniques include:
 Maximum Depth: It limits the maximum level of depth in a decision tree.
 Minimum Samples per Leaf: Set a minimum threshold for the number of
samples in each leaf node.
 Minimum Samples per Split: Specify the minimal number of samples needed
to break up a node.
 Maximum Features: Restrict the quantity of features considered for splitting.
By pruning early, we come to be with a simpler tree that is less likely to overfit the training
facts.

Post-Pruning (Reducing Nodes)

After the tree is fully grown, post-pruning involves removing branches or nodes to improve
the model’s ability to generalize.
Some common post-pruning techniques include:

 Cost-Complexity Pruning (CCP): This method assigns a price to each subtree


primarily based on its accuracy and complexity, then selects the subtree with the
lowest fee.
 Reduced Error Pruning: Removes branches that do not significantly affect the
overall accuracy.
 Minimum Impurity Decrease: Prunes nodes if the decrease in impurity (Gini
impurity or entropy) is beneath a certain threshold.
 Minimum Leaf Size: Removes leaf nodes with fewer samples than a specified
threshold.

Post-pruning simplifies the tree while preserving its Accuracy. Decision tree pruning helps
to improve the performance and interpretability of decision trees by reducing their
complexity and avoiding overfitting. Proper pruning can lead to simpler and more robust
models that generalize better to unseen data.

Decision Tree Implementation in Python

Here we are going to create a decision tree using preloaded dataset breast_cancer in
sklearn library.
The Decision Tree model is using pre-pruning technique, specifically, the default approach
of scikit-learn’s DecisionTreeClassifier, which employs the Gini impurity criterion for
making splits. This is evident from the parameter criterion="gini" passed to
the DecisionTreeClassifier() constructor. Gini impurity is a measure of how often a
randomly chosen element from the set would be incorrectly labeled if it were randomly
labeled according to the distribution of labels in the set.
Python3
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Load breast cancer dataset


X, y = load_breast_cancer(return_X_y=True)

# Separating Training and Testing data


X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, random_state=42)

# Train decision tree model


model = DecisionTreeClassifier(criterion="gini")
model.fit(X_train, y_train)

# Plot original tree


plt.figure(figsize=(15, 10))
plot_tree(model, filled=True)
plt.title("Original Decision Tree")
plt.show()
# Model Accuracy before pruning
accuracy_before_pruning = model.score(X_test, y_test)
print("Accuracy before pruning:", accuracy_before_pruning)

Output:
Accuracy before pruning: 0.8793859649122807

Decision Tree Pre-Pruning Implementation


In the implementation, we pruning technique is hyperparameter tuning through cross-
validation using GridSearchCV. Hyperparameter tuning involves searching for the optimal
hyperparameters for a machine learning model to improve its performance. It does not
directly prune the decision tree, but it helps in finding the best combination of
hyperparameters, such as max_depth, max_features, criterion, and splitter, which indirectly
controls the complexity of the decision tree and prevents overfitting. Therefore, it’s a form
of post-pruning technique.
Python3
from sklearn.tree import DecisionTreeClassifier
parameter = {
'criterion' :['entropy','gini','log_loss'],
'splitter':['best','random'],
'max_depth':[1,2,3,4,5],
'max_features':['auto','sqrt','log2']
}
model = DecisionTreeClassifier()
from sklearn.model_selection import GridSearchCV
cv = GridSearchCV(model,param_grid = parameter,cv = 5)
cv.fit(X_train,Y_train)
Visualizing
Python3
from sklearn.tree import export_graphviz
import graphviz
best_estimator = cv.best_estimator_
feature_names = features

dot_data = export_graphviz(best_estimator, out_file=None, filled=True, rounded=True,


feature_names=feature_names, class_names=['0', '1', '2'])
graph = graphviz.Source(dot_data)
graph.render("decision_tree", format='png', cleanup=True)
graph
Output:

Best Parameters
Python3
cv.score(X_test,Y_test)
cv.best_params_
Output:
0.9736842105263158
{'criterion': 'gini',
'max_depth': 4,
'max_features': 'sqrt',
'splitter': 'best'}
Decision Tree Post-Pruning Implementation
Python3
# Cost-complexity pruning (Post-pruning)
path = model.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

# Train a series of decision trees with different alpha values


pruned_models = []
for ccp_alpha in ccp_alphas:
pruned_model = DecisionTreeClassifier(criterion="gini", ccp_alpha=ccp_alpha)
pruned_model.fit(X_train, y_train)
pruned_models.append(pruned_model)

# Find the model with the best accuracy on test data


best_accuracy = 0
best_pruned_model = None
for pruned_model in pruned_models:
accuracy = pruned_model.score(X_test, y_test)
if accuracy > best_accuracy:
best_accuracy = accuracy
best_pruned_model = pruned_model
# Model Accuracy after pruning
accuracy_after_pruning = best_pruned_model.score(X_test, y_test)
print("Accuracy after pruning:", accuracy_after_pruning)
Output:
Accuracy after pruning: 0.918859649122807
Python3
# Plot pruned tree
plt.figure(figsize=(15, 10))
plot_tree(best_pruned_model, filled=True)
plt.title("Pruned Decision Tree")
plt.show()
Output:

Why Pruning decision trees is Important?


Decision Tree Pruning has an important role in optimizing the decision tree model. It
involves the removal of certain parts of the tree which can potentially reduce its
performance. Here is why decision tree pruning is important:
 Prevents Overfitting: Decision trees are prone to overfitting, where the model
memorizes the training data rather than learning generalizable patterns. Pruning
helps prevent overfitting by simplifying the tree structure, removing branches
that capture noise or outliers in the training data.
 Improves Generalization: By reducing the complexity of the decision tree,
pruning enhances the model’s ability to generalize to unseen data. A pruned
decision tree is more likely to capture underlying patterns in the data rather than
memorizing specific instances, leading to better performance on new data.
 Reduces Model Complexity: Pruning results in a simpler decision tree with
fewer branches and nodes. This simplicity not only makes the model easier to
interpret but also reduces computational requirements during both training and
inference. A simpler model is also less prone to overfitting and more robust to
changes in the data.
 Enhances Interpretability: Pruning produces decision trees with fewer
branches and nodes, which are easier to interpret and understand. This is
particularly important in applications where human insight into the decision-
making process is valuable, such as in medical diagnosis or financial decision-
making.
 Speeds Up Training and Inference: Pruned decision trees require less
computational resources during both training and inference phases. With fewer
branches and nodes, the decision-making process becomes more efficient,
resulting in faster predictions without sacrificing accuracy.
 Facilitates Model Maintenance: Pruning helps maintain decision tree models
over time by keeping them lean and relevant. As new data becomes available or
the problem domain evolves, pruned decision trees are easier to update and
adapt compared to overly complex, unpruned trees.

Practical issues in learning decision trees include

 Determining How Deeply To Grow The Decision Tree,


 Handling Continuous Attributes,
 Choosing An Appropriate Attribute Selection Measure,
 Handling Training Data With Missing Attribute Values,
 Handling Attributes With Differing Costs, And
 Improving Computational Efficiency.

Classification

As we know, the Supervised Machine Learning algorithm can be broadly classified into
Regression and Classification Algorithms. In Regression algorithms, we have predicted the
output for continuous values, but to predict the categorical values, we need Classification
algorithms.

What is the Classification Algorithm?

 The Classification algorithm is a Supervised Learning technique that is used to


identify the category of new observations on the basis of training data.
 In Classification, a program learns from the given dataset or observations and then
classifies new observation into a number of classes or groups. Such as, Yes or No, 0
or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or
categories.

Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with the
corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input variable(x).

1. y=f(x), where y = categorical output

The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.

Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.

The algorithm which implements the classification on a dataset is known as a classifier.


There are two types of Classifications:

Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
1. Multi-class Classifier: If a classification problem has more than two outcomes, then
it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:

In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it
receives the test dataset. In Lazy learner case, classification is done on the basis of the
most related data stored in the training dataset. It takes less time in training but more
time for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction.

Example: Decision Trees, Naïve Bayes, ANN.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the Mainly two category:

o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

Once our model is completed, it is necessary to evaluate its performance; either it is a


Classification or Regression model. So for evaluating a Classification model, we have the
following ways:

1. Log Loss or Cross-Entropy Loss:


o It is used for evaluating the performance of a classifier, whose output is a probability
value between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:

1. ?(ylog(p)+(1?y)log(1?p))

Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the


performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like as
below table:
Actual Positive Actual Negative
o

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-
ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-
axis and FPR(False Positive Rate) on X-axis.

Use cases of Classification Algorithms

Classification algorithms can be used in different places. Below are some popular use cases
of Classification Algorithms:

o Email Spam Detection


o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.

Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if
there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between the
vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can
be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

Python Implementation of Support Vector Machine

Now we will implement the SVM algorithm using Python. Here we will use the same
dataset user_data, which we have used in Logistic regression and KNN classification.

o Data Pre-processing step

Till the Data pre-processing step, the code will remain the same. Below is the code:

1. #Data Pre-processing Step


2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')
9.
10. #Extracting Independent and dependent Variable
11. x= data_set.iloc[:, [2,3]].values
12. y= data_set.iloc[:, 4].values
13.
14. # Splitting the dataset into training and test set.
15. from sklearn.model_selection import train_test_split
16. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

After executing the above code, we will pre-process the data. The code will give the dataset
as:

The scaled output for the test set will be:


Fitting the SVM classifier to the training set:

Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we
will import SVC class from Sklearn.svm library. Below is the code for it:

1. from sklearn.svm import SVC # "Support vector classifier"


2. classifier = SVC(kernel='linear', random_state=0)
3. classifier.fit(x_train, y_train)

In the above code, we have used kernel='linear', as here we are creating SVM for linearly
separable data. However, we can change it for non-linear data. And then we fitted the
classifier to the training dataset(x_train, y_train)

Output:

Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)

The model performance can be altered by changing the value of C(Regularization factor),
gamma, and kernel.
o Predicting the test set result:
Now, we will predict the output for test set. For this, we will create a new vector
y_pred. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

After getting the y_pred vector, we can compare the result of y_pred and y_test to check the
difference between the actual value and predicted value.

Output: Below is the output for the prediction of the test set:

o Creating the confusion matrix:


Now we will see the performance of the SVM classifier that how many incorrect
predictions are there as compared to the Logistic regression classifier. To create the
confusion matrix, we need to import the confusion_matrix function of the sklearn
library. After importing the function, we will call it using a new variable cm. The
function takes two parameters, mainly y_true( the actual values) and y_pred (the
targeted value return by the classifier). Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

Output:
As we can see in the above output image, there are 66+24= 90 correct predictions and 8+2=
10 correct predictions. Therefore we can say that our SVM model improved as compared to
the Logistic regression model.

o Visualizing the training set result:


Now we will visualize the training set result, below is the code for it:

1. from matplotlib.colors import ListedColormap


2. x_set, y_set = x_train, y_train
3. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max()
+ 1, step =0.01),
4. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
5. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
6. alpha = 0.75, cmap = ListedColormap(('red', 'green')))
7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
11. c = ListedColormap(('red', 'green'))(i), label = j)
12. mtp.title('SVM classifier (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()

Output:

By executing the above code, we will get the output as:


As we can see, the above output is appearing similar to the Logistic regression output. In the
output, we got the straight line as hyperplane because we have used a linear kernel in the
classifier. And we have also discussed above that for the 2d space, the hyperplane in SVM is
a straight line.

o Visualizing the test set result:

1. #Visulaizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max()
+ 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('SVM classifier (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:
By executing the above code, we will get the output as:

As we can see in the above output image, the SVM classifier has divided the users into two
regions (Purchased or Not purchased). Users who purchased the SUV are in the red region
with the red scatter points. And users who did not purchase the SUV are in the green region
with green scatter points. The hyperplane has divided the two classes into Purchased and not
purchased variable.

Naïve Bayes Classifier Algorithm


o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:


Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2
Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:


o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal distribution.
This means if predictors take continuous values instead of discrete, then the model
assumes that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular
word is present or not in a document. This model is also famous for document
classification tasks.

Python Implementation of the Naïve Bayes algorithm:

Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the
"user_data" dataset, which we have used in our other classification model. Therefore we can
easily compare the Naive Bayes model with the other models.

Steps to implement:

o Data Pre-processing step


o Fitting Naive Bayes to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1) Data Pre-processing step:

In this step, we will pre-process/prepare the data so that we can use it efficiently in our code.
It is similar as we did in data-pre-processing. The code for this is given below:

1. Importing the libraries


2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. # Importing the dataset
7. dataset = pd.read_csv('user_data.csv')
8. x = dataset.iloc[:, [2, 3]].values
9. y = dataset.iloc[:, 4].values
10.
11. # Splitting the dataset into the Training set and Test set
12. from sklearn.model_selection import train_test_split
13. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state =
0)
14.
15. # Feature Scaling
16. from sklearn.preprocessing import StandardScaler
17. sc = StandardScaler()
18. x_train = sc.fit_transform(x_train)
19. x_test = sc.transform(x_test)

In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set, and
then we have scaled the feature variable.

The output for the dataset is given as:


2) Fitting Naive Bayes to the Training Set:

After the pre-processing step, now we will fit the Naive Bayes model to the Training set.
Below is the code for it:

1. # Fitting Naive Bayes to the Training set


2. from sklearn.naive_bayes import GaussianNB
3. classifier = GaussianNB()
4. classifier.fit(x_train, y_train)

In the above code, we have used the GaussianNB classifier to fit it to the training dataset.
We can also use other classifiers as per our requirement.

Output:

Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)

3) Prediction of the test set result:

Now we will predict the test set result. For this, we will create a new predictor
variable y_pred, and will use the predict function to make the predictions.

1. # Predicting the Test set results


2. y_pred = classifier.predict(x_test)

Output:

The above output shows the result for prediction vector y_pred and real vector y_test. We
can see that some predications are different from the real values, which are the incorrect
predictions.

4) Creating Confusion Matrix:

Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix.
Below is the code for it:

1. # Making the Confusion Matrix


2. from sklearn.metrics import confusion_matrix
3. cm = confusion_matrix(y_test, y_pred)

Output:
As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions,
and 65+25=90 correct predictions.

5) Visualizing the training set result:

Next we will visualize the training set result using Naïve Bayes Classifier. Below is the code
for it:

1. # Visualising the Training set results


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max(
) + 1, step = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step
= 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(
X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:

In the above output we can see that the Naïve Bayes classifier has segregated the data points
with the fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our
code.

6) Visualizing the Test set result:

1. # Visualising the Test set results


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max(
) + 1, step = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step
= 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(
X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

The above output is final output for test set data. As we can see the classifier has created a
Gaussian curve to divide the "purchased" and "not purchased" variables. There are some
wrong predictions which we have calculated in Confusion matrix. But still it is pretty good
classifier.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.


o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager
learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

You might also like