KEMBAR78
FPA Unit 2 | PDF | Support Vector Machine | Statistical Classification
0% found this document useful (0 votes)
20 views20 pages

FPA Unit 2

Classification is the process of organizing data into groups based on shared characteristics, commonly used in machine learning to categorize data into predefined labels. The document discusses various classification models, including binary, multi-class, and multi-label classification, along with popular algorithms like K-Nearest Neighbours and Naïve Bayes. It also explains the workings of these algorithms, their advantages and disadvantages, and how to choose parameters like the K value in K-NN.

Uploaded by

Ghazi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views20 pages

FPA Unit 2

Classification is the process of organizing data into groups based on shared characteristics, commonly used in machine learning to categorize data into predefined labels. The document discusses various classification models, including binary, multi-class, and multi-label classification, along with popular algorithms like K-Nearest Neighbours and Naïve Bayes. It also explains the workings of these algorithms, their advantages and disadvantages, and how to choose parameters like the K value in K-NN.

Uploaded by

Ghazi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

2.

CLASSIFICATION
Classification is the process of organizing or sorting items, data, or ideas into specific groups based on
shared characteristics. By classifying things, we make it easier to analyse, understand, and make decisions
based on structured data. In machine learning, classification is a technique used to categorize data into
predefined labels. In short, classification is a form of “pattern recognition,” with classification algorithms
applied to the training data to find the same pattern (similar words or sentiments, number sequences, etc.) in
future sets of data.

For instance, an email spam filter classifies incoming emails as either "Spam" or "Not Spam" based on their
content. Similarly, image recognition models can classify pictures as "Dog," "Cat," or "Other" based on
patterns detected in the images. Classification helps automate decision-making processes and improves
efficiency in various applications.

There are two steps in the construction of a classification model.

1. Training the Model – The model learns by studying past data with known categories. It looks for
patterns and rules using algorithms like Decision Trees or Neural Networks. This helps it understand
how to classify new data.

2. Testing and Evaluation – The model is tested on new data to check how well it works. We measure
its accuracy using different methods like precision and recall. If it's not performing well, we make
improvements before using it.

Types of Classification Models:

1. Binary Classification – This type of classifier sorts data into only two categories. Examples include
classifying emails as "Spam" or "Not Spam" or detecting whether a patient has a disease ("Yes" or
"No"). It is the simplest form of classification. Algorithms like Logistic Regression and Support
Vector Machines are commonly used. The model predicts one of the two possible outcomes for each
input.

2. Multi-Class Classification – This classifier sorts data into more than two categories, where each
input belongs to only one class. Examples include classifying images as "Dog," "Cat," or "Horse" or
sorting news articles into "Sports," "Politics," or "Technology." It is used when there are multiple but
distinct categories. Algorithms like Decision Trees and Neural Networks handle such tasks. The
model chooses one category out of many for each input.

3. Multi-Label Classification – This type allows one input to belong to multiple categories at the same
time. Examples include tagging a movie as both "Action" and "Thriller" or marking an email as
"Important" and "Work-related." Unlike multi-class classification, an item can have more than one

1
label. Algorithms like Neural Networks and Random Forests are often used. The model predicts
multiple relevant labels for each input.

Popular Classification algorithms:

• K-Nearest Neighbours
• Naïve Bayes Classifier
• Support Vector Machine
• Decision trees

K-Nearest Neighbours
K-Nearest Neighbours (K-NN) is a simple and widely used Machine Learning algorithm based on the
Supervised Learning technique. It works by storing all available data and classifying new data points based
on their similarity to existing categories. Unlike other algorithms that learn patterns during training, K-NN is
a "lazy learner," meaning it does not build a model in advance but instead memorizes the dataset and
performs calculations only when needed.

The accuracy of the classification depends on the value of "K," which determines how many neighbours are
considered for comparison.

KNN Algorithm can be used for both classification and regression predictive problems. However, it is more
widely used in classification problems in the industry.

For example, suppose we have an image of an animal that looks like both a cat and a dog. The K-NN
algorithm will compare its features, such as ear shape, fur colour, and body structure, with stored images of
cats and dogs. Based on the most similar category, it will classify the image as either a cat or a dog.

Advantages of K-NN Algorithm:

• It is simple to understand and easy to implement.

2
• It works well with small datasets and is useful for classification tasks.
• It is resistant to noisy data, meaning errors in training data do not affect it much.

• It performs better when there is a large and well-distributed dataset.

Disadvantages of K-NN Algorithm:


• Choosing the right value of K can be difficult and affect accuracy.

• It is slow for large datasets because it compares each new data point with all stored data.
• It does not create a model in advance, making it memory-intensive.

• It may not work well if irrelevant or too many features are present in the dataset.

How Does K-NN Work?

The K-Nearest Neighbours (K-NN) algorithm works by comparing new data points with existing data and
classifying them based on similarity. It follows a step-by-step approach to determine the category of a given
data point.
1. Choose the Number of Neighbours (K):
Start by deciding how many neighbours (data points from your dataset) you want to consider when
making predictions. This is your ‘K’ value. A small K may result in noisy predictions, while a large
K can smooth out the classification.
2. Calculate the Distance:
Measure the distance between the new data point and all other data points in the dataset. The most
commonly used method is Euclidean distance, which calculates the straight-line distance between
two points.
3. Identify the Nearest Neighbours:
Select the K data points that have the shortest distance to the new data point. These are considered
the closest neighbours.

4. Count the Categories of the Neighbours:


Among the K neighbours, count how many belong to each category. For example, if K=5 and three
neighbours belong to Category A while two belong to Category B, Category A has more influence.
5. Assign the New Data Point to the Majority Category:
The new data point is classified into the category that appears most frequently among its K
neighbours. If most of the neighbours belong to Category A, the new point is also assigned to
Category A.

3
Example:
Problem Statement:

We have a dataset with two categories: Category A (Red Circles) and Category B (Blue Squares). A new data
point (Green Triangle) appears, and we need to classify it into Category A or Category B using K-NN with K
= 5 neighbours.

Solution:

We start by selecting the number of neighbours, choosing K = 5, meaning the new data point will be
compared with the five closest points in the dataset.

Next, we calculate the Euclidean Distance between the new data point and all existing points using the
formula.

After computing the distances, we identify the five nearest neighbours from the dataset, which are the
closest points to the new data.

By calculating the Euclidean distance, we got the nearest neighbours. Among these five neighbours, we
count how many belong to Category A and how many belong to Category B. Suppose three of the nearest
neighbours belong to Category A, and two belong to Category B.

Since the majority of the closest points are from Category A, the new data point is assigned to Category A.

4
How to choose K value?

• There is no specific method to determine the best value for K in the K-NN algorithm. It is usually
selected by trying different values and observing which one provides the most accurate results. The
most commonly preferred value for K is 5, as it balances accuracy and model performance.
• A very low K value, such as K = 1 or K = 2, can make the model highly sensitive to noise and
outliers.
• A large K value helps smooth out the decision boundary and reduces the impact of outliers. However,
if K is too large, the model may struggle to capture small patterns in the data, leading to an overly
generalized classification that reduces accuracy.

• The choice of K affects predictions. If K = 3, the majority class among three neighbours decides the
classification, but if K = 7, the result may change due to a different majority. Picking the right K is
important for accuracy.
• KNN is a lazy learning algorithm, meaning it does not update distances every time. Instead, it stores
all training data and calculates distances only when a new data point needs classification. A good K
value helps balance performance and accuracy.
• Cross-validation helps find the best K by testing different values on parts of the dataset. This ensures
the chosen K provides the most accurate results while avoiding mistakes like overfitting or
underfitting.

Naïve Bayes Classification


Naïve Bayes classification is a machine learning algorithm based on Bayes' theorem, that gives a
mathematical rule for inverting conditional probabilities, allowing one to find the probability of a cause
given its effect. It is called "naïve" because it assumes that all features of a data point are independent of
each other, which simplifies calculations. Naive Bayes is also known as simple Bayes or independence
Bayes.

5
A simple example of Naïve Bayes classification is movie genre prediction. Suppose we have a system that
predicts a movie's genre based on keywords in its description. If a movie description contains words like
"spaceship," "alien," and "planet," the model calculates the probability that the movie belongs to the sci-fi
genre. Based on past data, it classifies new movies into genres with high accuracy.

Bayes theorem is a fundamental concept in probability theory that describes the probability of an event,
based on prior knowledge of conditions that might be related to the event. Bayes theorem is also known as
“Probability of Causes”.

Bayes' Theorem Formula:

Where:

• P(A|B) = Probability of event A occurring given that event B has occurred (Posterior Probability).

• P(B|A) = Probability of event B occurring given that event A has occurred (Likelihood).

• P(A) = Probability of event A occurring (Prior Probability).

• P(B) = Probability of event B occurring (Marginal Probability).

Advantages of Naïve Bayes Classifier

• Naïve Bayes is a simple and fast algorithm, making it easy to implement for classification tasks.

• It works well for both binary and multi-class classification, handling multiple categories efficiently.

• The algorithm performs better than many other models in multi-class classification problems.

• It is highly effective for text classification tasks, such as spam filtering and sentiment analysis.

• Since it requires a small amount of training data, it is computationally efficient and useful for real-
time predictions.

6
Disadvantages of Naïve Bayes Classifier

• The assumption of independence between features can be unrealistic, leading to incorrect


predictions.

• It cannot capture relationships between variables, making it less effective when features are
correlated.
• If a category in the dataset lacks training data for a feature, the model assigns zero probability to that
category, causing issues.
• Naïve Bayes does not perform well with continuous data unless assumptions about distribution (e.g.,
Gaussian) are made.
• It is sensitive to imbalanced datasets, meaning it may favor majority classes and give biased
predictions.

How to Identify When Naïve Bayes is Applicable


• When features are independent: Naïve Bayes works well when the features in the dataset are mostly
independent, as it assumes no correlation between them. If feature dependence is minimal, the model
performs efficiently.

• When working with text classification: It is highly effective for text-based tasks like spam detection,
sentiment analysis, and document classification, where word occurrences are treated as independent
features.
• When fast and simple classification is needed: Naïve Bayes is a great choice when you need a quick,
computationally efficient model that works well with small datasets and can make real-time
predictions.

How Does Naïve Bayes Classification Work?

Naïve Bayes classification works by applying Bayes' Theorem to predict the category of a given data point
based on probabilities. The classification process is divided into two main phases: Training Phase and
Prediction Phase.

Training Phase:
1. Calculate Prior Probabilities: The algorithm first determines the probability of each class occurring
in the dataset. This is called the prior probability.
2. Calculate Likelihood: It then calculates how likely each feature appears in a given class. This is
known as the likelihood. If the features are continuous, the likelihood is modelled using distributions
such as Gaussian (normal distribution).

3. Assume Feature Independence: Since the Naïve Bayes classifier assumes that features are
independent, it calculates the probability of each feature separately and then multiplies them.

7
Prediction Phase:
4. Calculate Posterior Probability: When a new data point is given, the algorithm uses Bayes' theorem
to compute the posterior probability for each class. This is the probability of a class occurring, given
the observed features.

5. Assign to the Most Probable Class: The model then assigns the new data point to the class with the
highest posterior probability, meaning the category where the data point is most likely to belong.

Example

Problem Statement:
If we want to find out whether a customer will default on a loan given that they have a low credit score, we
can use Bayes' theorem:
Step-by-Step Calculation:

Step 1: Define Events


• A = Customer defaults on the loan

• B = Customer has a low credit score

We want to calculate P(A|B) → Probability that a customer defaults, given that they have a low credit score.

Step 2: Use Bayes’ Theorem

Where:

• P(B|A) = Probability of having a low credit score, given that the person defaulted.
• P(A) = Probability that a random customer defaults (default rate).
• P(B) = Probability that a random customer has a low credit score.

Step 3: Assign Sample Values


Assume the following probabilities based on historical data:

• P(B|A) = 0.85 (85% of defaulters had low credit scores)


• P(A) = 0.30 (30% of all customers default)

• P(B) = 0.40 (40% of all customers have low credit scores)

8
Step 4: Calculate P(A|B)

Step 5: Interpret the Result


The probability that a customer will default given that they have a low credit score is 63.75%. This means
that based on historical data, a customer with a low credit score has a high chance of defaulting.

Support Vector Machines


Support Vector Machine (SVM) is a machine learning algorithm used to classify data into different
categories. It works by finding the best possible boundary, called a hyperplane, that separates the data points
into different groups. The data points that are closest to this boundary are called support vectors, and they
help determine the best way to divide the data. SVM is mainly used for classification problems, though it
can also be applied to regression.

For example, suppose a company wants to classify customers as either "loyal" or "one-time buyers" based
on their shopping habits. SVM looks at factors like purchase frequency and spending habits to draw a
boundary between the two types of customers. Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyperplane:

SVM can be of two types:

• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such data is termed as non-linear data and classifier
used is called as Non-linear SVM classifier.

9
Key Terminologies in SVM

• Hyperplane
The hyperplane is the decision boundary that separates different classes in a dataset. In a 2D space,
it is a straight line, while in a 3D space, it is a plane. For higher dimensions, it is difficult to visualize
but still serves as a boundary. SVM aims to find the optimal hyperplane that maximizes the
separation between the classes, ensuring the best classification performance.

• Support Vectors
Support vectors are the closest data points to the hyperplane from both classes. These points
influence the position and orientation of the hyperplane. They are essential because even if other
data points are removed, the decision boundary remains unchanged as long as the support vectors
are present.

• Margin
The margin is the distance between the hyperplane and the nearest support vectors from each
class. A larger margin helps the model generalize better to new data and reduces the risk of
overfitting. SVM selects the hyperplane with the maximum margin, ensuring a better distinction
between classes.

• Hard Margin
A hard margin SVM strictly separates classes with no misclassification, meaning every data point
must be perfectly classified. This method only works when the data is perfectly separable and does
not have noise. However, in real-world scenarios, data is often messy, making hard margin SVM less
practical.

• Soft Margin
A soft margin SVM allows some misclassification, meaning a few data points may be on the wrong
side of the hyperplane. This approach helps SVM handle noisy and overlapping data, improving
flexibility and generalization. Soft margin SVM is preferred in most real-world applications.

• Maximum Margin
The maximum margin refers to the largest possible distance between the hyperplane and the
nearest support vectors from each class. A wider margin helps improve the generalization ability of
the model, making it perform well on new, unseen data while reducing errors.

• Kernel Trick
When data is not linearly separable, the kernel trick helps transform it into a higher-dimensional
space where it becomes separable. Instead of manually adding more dimensions, the kernel
function automatically computes this transformation. Common kernel functions include linear,
polynomial, and radial basis function (RBF), each useful for different types of data.

• C (Regularization Parameter)
The C parameter in SVM controls the balance between achieving a large margin and minimizing
classification errors. A higher C value tries to classify all training data correctly, leading to
overfitting if the data has noise. A lower C value allows some misclassification, improving
generalization and preventing overfitting. The choice of C depends on the dataset and the desired
trade-off between accuracy and flexibility.

10
• Dual Problem in SVM
The dual problem in SVM is an alternative mathematical formulation that makes it easier to solve
high-dimensional problems. Instead of working with feature space directly, it transforms the
optimization problem using Lagrange multipliers. This approach is computationally efficient,
especially when using kernel tricks to handle non-linearly separable data. The dual problem helps
SVM find the optimal hyperplane without explicitly mapping data to a higher dimension.

• Hinge Loss
Hinge loss is the loss function used in SVM to measure misclassification errors and maximize the
margin between classes. If a data point is correctly classified and far from the hyperplane, the loss is
zero. However, if it is incorrectly classified or within the margin, a penalty is applied based on how
far it is from the correct side. Hinge loss ensures that the model prioritizes correct classification with
maximum separation from the decision boundary.

Advantages of SVM:

• Works well for high-dimensional data.

• Effective when the number of dimensions is greater than the number of samples.

• Finds the optimal hyperplane for better classification.

• Performs well with a clear margin of separation.

• Can handle non-linearly separable data using kernel tricks.

Disadvantages of SVM:

• Computationally expensive for large datasets.

• Struggles with overlapping classes in noisy data.

• Choosing the right kernel can be complex.

• Does not perform well with very large datasets.

• Difficult to interpret compared to simpler models.

How does Linear SVM work?

Problem Statement:
We have a dataset with two classes, green and blue, and two features, x1 and x2. Our objective is to
develop a classifier that accurately categorizes a new data point as either green or blue using the best
possible decision boundary.

11
Solution:

1. Identifying Possible Decision Boundaries

In a two-dimensional space where data points belong to two different classes (green and blue), we can
separate them using a straight line. However, there are multiple possible lines that can divide the data.

2. Finding the Optimal Hyperplane

Among the many possible lines, SVM aims to find the optimal hyperplane, which is the best decision
boundary that maximizes separation between the two classes. The hyperplane is a straight line, i.e.,
decision boundary that separates different classes in the dataset.
3. Selecting Support Vectors

Not all data points are equally important for defining the decision boundary. SVM identifies the data points
that are closest to the hyperplane—these are called support vectors. These support vectors are crucial
because they directly influence the position and orientation of the hyperplane. The hyperplane is chosen in
a way that maximizes the distance between it and these support vectors.

4. Maximizing the Margin

The margin is the distance between the hyperplane and the nearest support vectors from each class. SVM's
objective is to maximize this margin so that the classification boundary is as far away from both classes as
possible. A larger margin reduces the risk of misclassification and improves the model's ability to generalize
well to unseen data. Among all possible hyperplanes, SVM selects the one that achieves the widest margin
while still correctly classifying most training data points.

5. Classifying New Data Points

Once the optimal hyperplane is established, any new data point can be classified based on which side of
the hyperplane it falls on.

• If the point falls on one side, it belongs to Class A (green).

• If it falls on the other side, it belongs to Class B (blue).


By ensuring a clear and maximized margin between the two classes, SVM improves classification
accuracy and robustness.

How does Non-Linear SVM work?

Problem Statement

12
We have a dataset with two classes, green and blue, and two features, x1 and x2. Our objective is to
develop a classifier that accurately categorizes a new data point as either green or blue using the best
possible decision boundary.

Solution

1. Identifying the Problem in 2D Space


We have green and blue points in a 2D space (x, y) that cannot be separated by a straight line. A better
method is needed to classify them correctly. The green and blue points overlap, making it impossible to
separate them using a straight line. A new approach is required to classify them properly.

2. Introducing a Third Dimension


So, to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third-dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

13
3. Finding the Best Hyperplane
In the newly transformed 3D space, SVM finds the best hyperplane that separates the data points. The
hyperplane appears as a plane parallel to the x-axis, which divides the data into two distinct classes.

4. Projecting Back to 2D Space


When converted back to 2D, the decision boundary appears as a circle instead of a line. This allows SVM to
handle complex, non-linear separations.

5. Classifying New Data Points


New points inside the circular boundary belong to the green class, while points outside belong to the blue
class. This improves classification accuracy.
SVM transforms non-linearly separable data into linearly separable data using higher dimensions, leading
to better classification results.

Decision Trees
A Decision Tree is a type of machine learning model used to make predictions. It works like a flowchart,
where each question (or decision) splits the data into smaller parts until a final answer is reached. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome. In a Decision tree, there are two nodes, which are the
Decision Node and Leaf Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further branches.

It is a graphical representation for getting all the possible solutions to a problem/decision based on given
conditions. o It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure. The model keeps splitting data until it reaches a clear

14
conclusion. It decides the best way to split using methods like Gini index or information gain, which
measure how well a split separates the data.

Key Terminologies in Decision trees

• Root Node

The root node is the starting point of a decision tree that represents the entire dataset. It is the first node
where data is split based on a selected feature. The root node serves as the foundation for the tree’s
branching structure.

• Leaf/Terminal Node

A leaf node (or terminal node) is the final output of a decision tree where no further splits occur. It
represents a classification or a predicted value in regression problems. Once a data point reaches a leaf node,
a final decision is made.

• Splitting

Splitting is the process of dividing a node into two or more sub-nodes based on specific conditions. It helps
in segregating the dataset into smaller, more homogeneous groups. The goal is to create meaningful
distinctions for better classification or prediction.

• Parent and Child Node

A parent node is a node that splits into one or more child nodes during the decision tree formation. The child
nodes are the result of applying decision rules to the parent node. Each child node may further split into
additional child nodes, forming a hierarchy.

• Branch/Sub-tree

A branch (or sub-tree) is a section of a decision tree that extends from a parent node. It represents a path of
decisions taken from the root to a specific outcome. A sub-tree can function independently as a smaller
decision tree.

15
• Decision Node

A decision node is an internal node that represents a condition or decision rule in the tree. It has one or more
branches leading to child nodes based on different possible outcomes. Decision nodes help guide data points
to their final classification or prediction.

• Pruning

Pruning is the process of removing unnecessary branches from a decision tree to prevent overfitting. It helps
simplify the tree by cutting parts that do not contribute much to accuracy. There are two types: pre-pruning,
which stops the tree from growing too deep, and post-pruning, which trims the tree after it is fully grown.
Pruning improves model generalization by reducing complexity and making predictions more reliable. A
well-pruned tree balances accuracy and efficiency for better performance on new data.

• Gini index

Gini Index is a measure of impurity used in decision trees to determine the best feature for splitting the data.
It calculates how often a randomly chosen element from the set would be incorrectly labelled if randomly
classified. The formula for Gini Index is:

Gini=1−∑pi2

where pi is the probability of each class in the dataset. A Gini Index of 0 means perfect purity (all instances
belong to one class), while a higher value indicates more impurity. The feature with the lowest Gini Index is
chosen for splitting to create the most homogeneous groups.

• Information gain

Information Gain (IG) measures the reduction in entropy after splitting a dataset based on a specific feature.
It helps determine which attribute provides the most useful information for classification. A higher IG value
indicates a better feature for splitting, leading to a more efficient decision tree. The decision tree algorithm
selects the attribute with the highest information gain first. It is calculated using the formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy (each feature)’

• Entropy

Entropy measures the impurity or randomness in a dataset, helping determine how mixed the classes are. If a
dataset is perfectly pure (all data points belong to one class), entropy is 0, meaning no uncertainty. If the
dataset is evenly split between classes, entropy is highest at 1, indicating maximum disorder. Entropy helps
in deciding how to split a decision tree by evaluating how much uncertainty a feature removes. A lower
entropy after splitting means a better feature for classification. It is calculated using the formula:

16
Entropy(S)=−∑pilog2pi

where pi is the probability of each class in the dataset.

• Random forest

Random Forest is a machine learning algorithm that uses many decision trees to make better predictions. It
picks random parts of the data and builds multiple trees, then combines their results for the final answer.
This method makes the model more accurate and less likely to make mistakes. It works well for both
numbers and categories in data. It is a strong and flexible algorithm that handles large and complex data
well.

Steps to Build a Decision Tree

1. Select the Best Feature (Root Node)


The first step is choosing the most important feature to split the dataset. This is done using criteria
like Gini Index or Entropy (Information Gain), which measure how well a feature separates the data.
The feature that provides the best separation becomes the root node. A well-chosen root node helps
create an effective decision tree.

2. Splitting the Dataset


Once the root node is selected, the dataset is divided into smaller groups based on feature values.
Each subset contains similar data points, making classification easier. This step helps in breaking
down complex problems into simpler parts. The goal is to make each group as pure as possible.

3. Create Decision Nodes and Leaf Nodes


After splitting, decision nodes are created where further divisions are needed, while leaf nodes
represent final outcomes. Decision nodes help refine classification by considering additional
features. If no further splitting is needed, the node becomes a leaf. Leaf nodes represent the final
predicted class or value.

4. Repeat Splitting Until Stopping Criteria is Met


The process continues until the tree reaches its maximum depth or further splitting does not improve
accuracy. Stopping conditions prevent overfitting and unnecessary complexity. This ensures that the
model remains generalizable to new data. A well-pruned tree balances accuracy and simplicity.

5. Pruning the Tree (Optional Step)


After constructing the tree, pruning removes unnecessary branches to prevent overfitting. Pruning
simplifies the tree by cutting branches that do not significantly improve accuracy. This step enhances
the model’s ability to work with new data. A pruned tree is more efficient and interpretable.

17
Steps to Build a Decision Tree (Simple Example: Deciding Whether to Play Outside)

Step 1: Collect Data

First, gather relevant information like weather, temperature, and wind speed to decide if playing outside is a
good idea.

Step 2: Choose the Best Attribute

Find the most important factor that affects the decision. If weather is the key factor (since rain might stop
outdoor play), we select it first.

Step 3: Split the Data Based on the Chosen Attribute

We separate days into three categories: sunny, rainy, and cloudy. Each category may have different
conditions for playing outside.

Step 4: Create Decision Nodes and Continue Splitting

For sunny days, we check wind speed.


• Low wind → Play outside (Yes)
• High wind → Don’t play (No)

For rainy days, we consider temperature.


• Cold weather → Don’t play (No)

• Mild weather → Play outside (Yes)


Once we can confidently decide whether to play or not for all conditions, we stop splitting. The final points
are called leaf nodes, which give the decision.

Weather

Cloudy Windy Temperature

Low wind High wind Cold weather Mild weather

Yes Yes No No Yes

18
Advantages of decision trees:

• Easy to understand and interpret with visual representation.


• Handles both numerical and categorical data efficiently.
• No need for feature scaling, making preprocessing simpler.
• Automatically selects important features for decision-making.
• Works well in non-parametric settings without distribution assumptions.

Disadvantages of decision trees:

• Prone to overfitting, especially with deep trees.


• Sensitive to noisy data, leading to high variance.
• Biased towards dominant classes in imbalanced datasets.
• Computationally expensive for deep trees with large datasets.
• Small changes in data can significantly alter tree structure.

Applications:

• Credit Scoring
Banks and financial institutions assess loan applicants’ risk by analysing their credit history, income, and
past transactions. This helps in approving or rejecting loan applications based on predefined risk
thresholds.
• Medical Diagnosis
Classification models help predict diseases by analysing patient symptoms, lab reports, and historical
medical records. Doctors use these models to assist in early diagnosis and treatment planning.
• Fraud Detection
Financial institutions use classification to identify fraudulent transactions based on spending patterns,
transaction locations, and anomaly detection. This helps in preventing credit card fraud and online
payment scams.
• Spam Filtering
Email services classify incoming messages as spam or not spam based on keywords, sender history, and
probability analysis. This prevents users from receiving unnecessary or harmful emails.
• Sentiment Analysis
Businesses analyse customer feedback, social media posts, and product reviews to classify opinions as
positive, negative, or neutral. This helps in understanding public perception and improving services.
• Customer Segmentation

19
Companies group customers based on their purchasing behaviour, preferences, and demographics. This
helps in personalized marketing campaigns and product recommendations.
• Document Categorization
Text classification techniques categorize documents into topics like news, sports, finance, or
entertainment. This is useful for organizing large datasets and improving search accuracy.
• Face Recognition
AI models classify facial images based on unique features to identify individuals in security systems and
social media applications. This enhances authentication and surveillance technologies.
• Speech Recognition
Voice assistants and speech-to-text applications classify spoken words into text using machine learning
models. This helps in enabling hands-free control and accessibility features.
• Product Recommendation
E-commerce platforms classify user preferences and browsing history to suggest relevant products. This
enhances the shopping experience and boosts sales through targeted advertising.

20

You might also like