KEMBAR78
DWM Unit 3 | PDF | Cross Validation (Statistics) | Mean Squared Error
0% found this document useful (0 votes)
13 views18 pages

DWM Unit 3

This document provides an overview of classification in data mining, detailing the process of building classification models, including data collection, preprocessing, feature selection, model selection, training, and evaluation. It discusses various classification algorithms such as decision trees, support vector machines, and neural networks, along with their applications and associated tools. Additionally, it highlights the advantages and disadvantages of classification methods, including real-life examples and the importance of decision trees in making predictions.

Uploaded by

Daksh Mahajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views18 pages

DWM Unit 3

This document provides an overview of classification in data mining, detailing the process of building classification models, including data collection, preprocessing, feature selection, model selection, training, and evaluation. It discusses various classification algorithms such as decision trees, support vector machines, and neural networks, along with their applications and associated tools. Additionally, it highlights the advantages and disadvantages of classification methods, including real-life examples and the importance of decision trees in making predictions.

Uploaded by

Daksh Mahajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

By Asst. Prof.

Archana Kalia

UNIT 3 Classification

Basic Concepts, Decision Tree Induction, Naïve Bayesian Classification, Accuracy and
Error measures, Evaluating the Accuracy of a Classifier: Holdout & Random
Subsampling, Cross Validation, Bootstrap.

Classification is a task in data mining that involves assigning a class label to each instance in
a dataset based on its features. The goal of classification is to build a model that accurately
predicts the class labels of new instances based on their features.

There are two main types of classification: binary classification and multi-class classification.
Binary classification involves classifying instances into two classes, such as "spam" or "not
spam", while multi-class classification involves classifying instances into more than two
classes.

The process of building a classification model typically involves the following steps:

Data Collection:
The first step in building a classification model is data collection. In this step, the data
relevant to the problem at hand is collected. The data should be representative of the problem
and should contain all the necessary attributes and labels needed for classification. The data
can be collected from various sources, such as surveys, questionnaires, websites, and
databases.

Data Preprocessing:
The second step in building a classification model is data preprocessing. The collected data
needs to be preprocessed to ensure its quality. This involves handling missing values, dealing
with outliers, and transforming the data into a format suitable for analysis. Data
preprocessing also involves converting the data into numerical form, as most classification
algorithms require numerical input.

Handling Missing Values: Missing values in the dataset can be handled by replacing them
with the mean, median, or mode of the corresponding feature or by removing the entire
record.

Dealing with Outliers: Outliers in the dataset can be detected using various statistical
techniques such as z-score analysis, boxplots, and scatterplots. Outliers can be removed from
the dataset or replaced with the mean, median, or mode of the corresponding feature.

Data Transformation: Data transformation involves scaling or normalizing the data to bring it
into a common scale. This is done to ensure that all features have the same level of
importance in the analysis.

Feature Selection:
The third step in building a classification model is feature selection. Feature selection
involves identifying the most relevant attributes in the dataset for classification. This can be
done using various techniques, such as correlation analysis, information gain, and principal

1
By Asst. Prof. Archana Kalia

component analysis.
Correlation Analysis: Correlation analysis involves identifying the correlation between the
features in the dataset. Features that are highly correlated with each other can be removed as
they do not provide additional information for classification.

Information Gain: Information gain is a measure of the amount of information that a feature
provides for classification. Features with high information gain are selected for classification.

Principal Component Analysis:


Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of the
dataset. PCA identifies the most important features in the dataset and removes the redundant
ones.

Model Selection:
The fourth step in building a classification model is model selection. Model selection
involves selecting the appropriate classification algorithm for the problem at hand. There are
several algorithms available, such as decision trees, support vector machines, and neural
networks.
Decision Trees: Decision trees are a simple yet powerful classification algorithm. They
divide the dataset into smaller subsets based on the values of the features and construct a tree-
like model that can be used for classification.

Support Vector Machines: Support Vector Machines (SVMs) are a popular classification
algorithm used for both linear and nonlinear classification problems. SVMs are based on the
concept of maximum margin, which involves finding the hyperplane that maximizes the
distance between the two classes.

Neural Networks:

Neural Networks are a powerful classification algorithm that can learn complex patterns in
the data. They are inspired by the structure of the human brain and consist of multiple layers
of interconnected nodes.

Model Training:
The fifth step in building a classification model is model training. Model training involves
using the selected classification algorithm to learn the patterns in the data. The data is divided
into a training set and a validation set. The model is trained using the training set, and its
performance is evaluated on the validation set.
Model Evaluation:
The sixth step in building a classification model is model evaluation. Model evaluation
involves assessing the performance of the trained model on a test set. This is done to ensure
that the model generalizes well

Classification is a widely used technique in data mining and is applied in a variety of


domains, such as email filtering, sentiment analysis, and medical diagnosis.

2
By Asst. Prof. Archana Kalia

Classification: It is a data analysis task, i.e. the process of finding a model that describes and
distinguishes data classes and concepts. Classification is the problem of identifying to which
of a set of categories (subpopulations), a new observation belongs to, on the basis of a
training set of data containing observations and whose categories membership is known.

Example: Before starting any project, we need to check its feasibility. In this case, a
classifier is required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project
and to further approve it. It is a two-step process such as:

1. Learning Step (Training Phase): Construction of Classification Model


Different Algorithms are used to build a classifier by making the model learn using
the training set available. The model has to be trained for the prediction of accurate
results.

2. Classification Step: Model used to predict class labels and testing the constructed
model on test data and hence estimate the accuracy of the classification rules.

3
By Asst. Prof. Archana Kalia

Test data are used to estimate the accuracy of the classification rule

Training and Testing:

Suppose there is a person who is sitting under a fan and the fan starts falling on him, he
should get aside in order not to get hurt. So, this is his training part to move away. While
Testing if the person sees any heavy object coming towards him or falling on him and moves
aside then the system is tested positively and if the person does not move aside then the
system is negatively tested.
The same is the case with the data, it should be trained in order to get the accurate and best
results.

There are certain data types associated with data mining that actually tells us the format of the
file (whether it is in text format or in numerical format).

Attributes – Represents different features of an object. Different types of attributes are:

1. Binary: Possesses only two values i.e. True or False


Example: Suppose there is a survey evaluating some products. We need to check
whether it’s useful or not. So, the Customer has to answer it in Yes or No.
Product usefulness: Yes / No

• Symmetric: Both values are equally important in all aspects

• Asymmetric: When both the values may not be important.

2. Nominal: When more than two outcomes are possible. It is in Alphabet form rather
than being in Integer form.
Example: One needs to choose some material but of different colors. So, the color
might be Yellow, Green, Black, Red.
Different Colors: Red, Green, Black, Yellow

• Ordinal: Values that must have some meaningful order.


Example: Suppose there are grade sheets of few students which might contain
different grades as per their performance such as A, B, C, D
Grades: A, B, C, D

• Continuous: May have an infinite number of values, it is in float type


Example: Measuring the weight of few Students in a sequence or orderly
manner i.e. 50, 51, 52, 53
Weight: 50, 51, 52, 53
• Discrete: Finite number of values.
Example: Marks of a Student in a few subjects: 65, 70, 75, 80, 90
Marks: 65, 70, 75, 80, 90

4
By Asst. Prof. Archana Kalia

Syntax:

• Mathematical Notation: Classification is based on building a function taking input


feature vector “X” and predicting its outcome “Y” (Qualitative response taking values
in set C)

• Here Classifier (or model) is used which is a Supervised function, can be designed
manually based on the expert’s knowledge. It has been constructed to predict class
labels (Example: Label - “Yes” or “No” for the approval of some event).

Classifiers can be categorized into two major types:

1. Discriminative: It is a very basic classifier and determines just one class for each row
of data. It tries to model just by depending on the observed data, depends heavily on
the quality of data rather than on distributions.
Example: Logistic Regression

2. Generative: It models the distribution of individual classes and tries to learn the
model that generates the data behind the scenes by estimating assumptions and
distributions of the model. Used to predict the unseen data.
Example: Naive Bayes Classifier
Detecting Spam emails by looking at the previous data. Suppose 100 emails and that
too divided in 1:4 i.e. Class A: 25%(Spam emails) and Class B: 75%(Non-Spam
emails). Now if a user wants to check that if an email contains the word cheap, then
that may be termed as Spam.
It seems to be that in Class A(i.e. in 25% of data), 20 out of 25 emails are spam and
rest not.
And in Class B(i.e. in 75% of data), 70 out of 75 emails are not spam and rest are
spam.
So, if the email contains the word cheap, what is the probability of it being spam ?? (=
80%)

Classifiers Of Machine Learning:

1. Decision Trees
2. Bayesian Classifiers

3. Neural Networks

4. K-Nearest Neighbour

5. Support Vector Machines

6. Linear Regression
7. Logistic Regression

5
By Asst. Prof. Archana Kalia

Associated Tools and Languages: Used to mine/ extract useful information from raw data.

• Main Languages used: R, SAS, Python, SQL

• Major Tools used: RapidMiner, Orange, KNIME, Spark, Weka

• Libraries used: Jupyter, NumPy, Matplotlib, Pandas, ScikitLearn, NLTK,


TensorFlow, Seaborn, Basemap, etc.

Real-Life Examples :
• Market Basket Analysis:
It is a modeling technique that has been associated with frequent transactions of
buying some combination of items.
Example: Amazon and many other Retailers use this technique. While viewing some
products, certain suggestions for the commodities are shown that some people have
bought in the past.
• Weather Forecasting:
Changing Patterns in weather conditions needs to be observed based on parameters
such as temperature, humidity, wind direction. This keen observation also requires the
use of previous records in order to predict it accurately.

Advantages:
• Mining Based Methods are cost-effective and efficient

• Helps in identifying criminal suspects

• Helps in predicting the risk of diseases

• Helps Banks and Financial Institutions to identify defaulters so that they may approve
Cards, Loan, etc.

Disadvantages:
Privacy: When the data is either are chance that a company may give some information about
their customers to other vendors or use this information for their profit.
Accuracy Problem: Selection of Accurate model must be there in order to get the best
accuracy and result.

A decision tree is a supervised learning algorithm used for both classification and regression
tasks. It has a hierarchical tree structure which consists of a root node, branches, internal
nodes and leaf nodes. It It works like a flowchart help to make decisions step by step where:

• Internal nodes represent attribute tests


• Branches represent attribute values

6
By Asst. Prof. Archana Kalia

• Leaf nodes represent final decisions or predictions.

Decision trees are widely used due to their interpretability, flexibility and low preprocessing

needs.

How Does a Decision Tree Work?

A decision tree splits the dataset based on feature values to create pure subsets ideally all
items in a group belong to the same class. Each leaf node of the tree corresponds to a class
label and the internal nodes are feature-based decision points.

Decision Tree

Let’s consider a decision tree for predicting whether a customer will buy a product based on
age, income and previous purchases: Here's how the decision tree works:

1. Root Node (Income)

First Question: "Is the person’s income greater than $50,000?"

• If Yes, proceed to the next question.

• If No, predict "No Purchase" (leaf node).

2. Internal Node (Age):

If the person’s income is greater than $50,000, ask: "Is the person’s age above 30?"

• If Yes, proceed to the next question.

• If No, predict "No Purchase" (leaf node).

3. Internal Node (Previous Purchases):


• If the person is above 30 and has made previous purchases, predict "Purchase" (leaf
node).
• If the person is above 30 and has not made previous purchases, predict "No Purchase"
(leaf node).
Decision making with 2 Decision Tree

Example: Predicting Whether a Customer Will Buy a Product Using Two Decision Trees

Tree 1: Customer Demographics

First tree asks two questions:

1. "Income > $50,000?"

• If Yes, Proceed to the next question.

• If No, "No Purchase"

7
By Asst. Prof. Archana Kalia

2. "Age > 30?"

• Yes: "Purchase"

• No: "No Purchase"

Tree 2: Previous Purchases

"Previous Purchases > 0?"

• Yes: "Purchase"

• No: "No Purchase"


Once we have predictions from both trees, we can combine the results to make a final
prediction. If Tree 1 predicts "Purchase" and Tree 2 predicts "No Purchase", the final
prediction might be "Purchase" or "No Purchase" depending on the weight or confidence
assigned to each tree. This can be decided based on the problem context

Understanding Decision Tree with Real life use case:

Till now we have understand about the attributes and components of decision tree. Now lets
jump to a real life use case in which how decision tree works step by step.

Step 1. Start with the Whole Dataset

We begin with all the data which is treated as the root node of the decision tree.

Step 2. Choose the Best Question (Attribute)

Pick the best question to divide the dataset. For example ask: "What is the outlook?"

Possible answers: Sunny, Cloudy or Rainy.


Step 3. Split the Data into Subsets
Divide the dataset into groups based on the question:

• If Sunny go to one subset.

• If Cloudy go to another subset.

• If Rainy go to the last subset.

Step 4. Split Further if Needed (Recursive Splitting)

For each subset ask another question to refine the groups. For example If the Sunny subset is
mixed ask: "Is the humidity high or normal?"

• High humidity → "Swimming".

• Normal humidity → "Hiking".

Step 5. Assign Final Decisions (Leaf Nodes)

When a subset contains only one activity, stop splitting and assign it a label:

8
By Asst. Prof. Archana Kalia

• Cloudy → "Hiking".

• Rainy → "Stay Inside".

• Sunny + High Humidity → "Swimming".

• Sunny + Normal Humidity → "Hiking".

Step 6. Use the Tree for Predictions

To predict an activity follow the branches of the tree. Example: If the outlook is Sunny and
the humidity is High follow the tree:
• Start at Outlook.

• Take the branch for Sunny.

• Then go to Humidity and take the branch for High Humidity.

• Result: "Swimming".
A decision tree works by breaking down data step by step asking the best possible questions
at each point and stopping once it reaches a clear decision. It's an easy and understandable
way to make choices. Because of their simple and clear structure decision trees are very
helpful in machine learning for tasks like sorting data into categories or making predictions.

Splitting Criteria in Decision Trees

In a Decision Tree, the process of splitting data at each node is important. The splitting
criteria finds the best feature to split the data on. Common splitting criteria include Gini
Impurity and Entropy.
• Gini Impurity: This criterion measures how "impure" a node is. The lower the Gini
Impurity the better the feature splits the data into distinct categories.
• Entropy: This measures the amount of uncertainty or disorder in the data. The tree
tries to reduce the entropy by splitting the data on features that provide the most
information about the target variable.

These criteria help decide which features are useful for making the best split at each decision
point in the tree.

Pruning in Decision Trees

• Pruning is an important technique used to prevent overfitting in Decision Trees.


Overfitting occurs when a tree becomes too deep and starts to memorize the training
data rather than learning general patterns. This leads to poor performance on new,
unseen data.
• This technique reduces the complexity of the tree by removing branches that have
little predictive power. It improves model performance by helping the tree generalize
better to new data. It also makes the model simpler and faster to deploy.

9
By Asst. Prof. Archana Kalia

• It is useful when a Decision Tree is too deep and starts to capture noise in the data.

Naive Bayes is a classification algorithm that uses probability to predict which category
a data point belongs to, assuming that all features are unrelated. This article will give
you an overview as well as more advanced use and implementation of Naive Bayes in
machine learning.

Types of Naive Bayes Model


There are three types of Naive Bayes Model :

1. Gaussian Naive Bayes

In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be
distributed according to a Gaussian distribution. A Gaussian distribution is also
called Normal distribution When plotted, it gives a bell shaped curve which is symmetric
about the mean of the feature values as shown below:

2. Multinomial Naive Bayes

Multinomial Naive Bayesis used when features represent the frequency of terms (such as
word counts) in a document. It is commonly applied in text classification, where term
frequencies are important.

3. Bernoulli Naive Bayes

Bernoulli Naive Bayes deals with binary features, where each feature indicates whether a
word appears or not in a document. It is suited for scenarios where the presence or absence of
terms is more relevant than their frequency. Both models are widely used in document
classification tasks

Naive Bayes Classifier Example

Bayes theorem is an extension of conditional probability. By using Bayes theorem, we have


to use one conditional probability to calculate another one.

To calculate P(A|B), we have to calculate P(B|A) first.

Example:

If you want to predict if a person has diabetes, given the conditions? P(A|B)
Diabetes → Class → A
Conditions → Independent attributes → B

To calculate this using Naive Bayes,


1. First, calculate P(B|A) → which means from the dataset find out how many of the
diabetic patient(A) has these conditions(B). This is called likelihood ratio P(B|A)

10
By Asst. Prof. Archana Kalia

2. Then multiply with P(A) →Prior probability →Probability of diabetic patient in the
dataset.

3. Then divide by P(B) → Evidence. This is the current event that occurred. Given this
event has occurred, we are calculating the probability of another event that will also
occur.

This concept is known as the Naive Bayes algorithm.

P(B|A) → Likelihood Ratio


P(A) → Prior Probability
P(A|B) → Posterior Probability
P(B) → Evidence

Dataset

Consider the problem of playing golf. Here in this dataset, Play is the target variable.
Whether we can play golf on a particular day or not is decided by independent
variables Outlook, Temperature, Humidity, Windy.

11
By Asst. Prof. Archana Kalia

Mathematical Explanation of Naive Bayes

Let’s predict given the conditions sunny, mild, normal, False → Whether he/she can play
golf?

Simplified Bayes theorem

P(A|B) and P(!A|B) is decided only by the numerator value because the denominator is the
same in both the equation.

So, to predict the class yes or no, we can use this formula P(A|B)=P(B|A)*P(A)

1. Calculate Prior Probability


Out of 14 records, 9 are yes. So P(yes)=9/14 and P(no)=5/14

12
By Asst. Prof. Archana Kalia

2. Calculate Likelihood Ratio

Outlook

Out of 14 records, 5-Sunny,4-Overcast,5-Rainy.

Find the probability of the day being sunny given he/she can play golf?

From the dataset, the number of sunny days we can play is 2. The total no of days we can
play is 9.

So P(Sunny | yes) =2/9


Similarly, we have to calculate all variables.

Temperature

13
By Asst. Prof. Archana Kalia

Humidity

Windy

Let’s predict given the conditions sunny, mild, normal, False → Whether he/she can play
golf?

A=yes
B=(Sunny,Mild,Normal,False)

P(A|B)=P((yes)|(Sunny,Mild,Normal,False)

P(A|B)=P(B|A)*P(A)
P(yes|(Sunny,Mild,Normal,False))= P((Sunny,Mild,Normal,False)|yes) *P(yes)

14
By Asst. Prof. Archana Kalia

[Probability of independent events is calculated by multiplying the probability of all the


events. Naive Bayes algorithm treats all the variables as independent variables)

=P(Sunny | yes)*P(Mild | yes)*P(Normal | yes)*P(False | yes)*P(yes)

=2/9 *4/9 *6/9 *6/9 *9/14

P(yes|(Sunny,Mild,Normal,False))= 0.0282

Let’s now calculate P(no|(Sunny,Mild,Normal,False))

P(no|(Sunny,Mild,Normal,False))= P((Sunny,Mild,Normal,False)|no) *P(no)


=P(Sunny | no) * P(Mild | no) * P(Normal | no) * P(False | no) * P(no)

=3/5 *2/5 *1/5 *2/5 *5/14

P(no|(Sunny,Mild,Normal,False))= =0.0068

Since 0.0282 > 0.0068[P(yes|conditions)>P(no|conditions) , for the given


conditions Sunny,Mild,Normal,False , play is predicted as yes.

Advantages of Naive Bayes Classifier

• Easy to implement and computationally efficient.

• Effective in cases with a large number of features.

• Performs well even with limited training data.

• It performs well in the presence of categorical features.

• For numerical features data is assumed to come from normal distributions

Disadvantages of Naive Bayes Classifier


• Assumes that features are independent, which may not always hold in real-world data.

• Can be influenced by irrelevant attributes.

• May assign zero probability to unseen events, leading to poor generalization.

Applications of Naive Bayes Classifier

• Spam Email Filtering: Classifies emails as spam or non-spam based on features.

• Text Classification: Used in sentiment analysis, document categorization, and topic


classification.

• Medical Diagnosis: Helps in predicting the likelihood of a disease based on


symptoms.

• Credit Scoring: Evaluates creditworthiness of individuals for loan approval.

• Weather Prediction: Classifies weather conditions based on various factors.

15
By Asst. Prof. Archana Kalia

Data Mining can be referred to as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging.

Techniques to evaluate the accuracy of classifiers.

HoldOut

In the holdout method, the largest dataset is randomly divided into three subsets:

• A training set is a subset of the dataset which are been used to build predictive
models.
• The validation set is a subset of the dataset which is been used to assess the
performance of the model built in the training phase. It provides a test platform for
fine-tuning of the model's parameters and selecting the best-performing model. It is
not necessary for all modeling algorithms to need a validation set.

• Test sets or unseen examples are the subset of the dataset to assess the likely future
performance of the model. If a model is fitting into the training set much better than it
fits into the test set, then overfitting is probably the cause that occurred here.

Basically, two-thirds of the data are been allocated to the training set and the remaining one-
third is been allocated to the test set.

16
By Asst. Prof. Archana Kalia

Random Subsampling

• Random subsampling is a variation of the holdout method. The holdout method is


been repeated K times.

• The holdout subsampling involves randomly splitting the data into a training set and a
test set.
• On the training set the data is been trained and the mean square error (MSE) is been
obtained from the predictions on the test set.

• As MSE is dependent on the split, this method is not recommended. So a new split
can give you a new MSE.

• The overall accuracy is been calculated as E = 1/K \sum_{k}^{i=1} E_{i}

Cross-Validation

• K-fold cross-validation is been used when there is only a limited amount of data
available, to achieve an unbiased estimation of the performance of the model.

• Here, we divide the data into K subsets of equal sizes.

• We build models K times, each time leaving out one of the subsets from the training,
and use it as the test set.
• If K equals the sample size, then this is called a "Leave-One-Out"

17
By Asst. Prof. Archana Kalia

Bootstrapping

• Bootstrapping is one of the techniques which is used to make the estimations from the
data by taking an average of the estimates from smaller data samples.

• The bootstrapping method involves the iterative resampling of a dataset with


replacement.

• On resampling instead of only estimating the statistics once on complete data, we can
do it many times.
• Repeating this multiple times helps to obtain a vector of estimates.

• Bootstrapping can compute variance, expected value, and other relevant statistics of
these estimates.

18

You might also like