0% found this document useful (0 votes)

8 views20 pages

Classification

The document discusses classification and prediction in data mining, highlighting the differences between the two processes. Classification predicts categorical labels using models constructed from training data, while prediction focuses on continuous-valued functions. It also covers various classification methods, issues in data preparation, evaluation metrics, and algorithms like Bayesian classification and k-Nearest Neighbors.

Uploaded by

Kannan Thangavelu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views20 pages

Classification

Uploaded by

Kannan Thangavelu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 20

Classification and Prediction

 What is classification? What is prediction?

 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Prediction
 Linear Regression
 Non linear regression

1
Classification vs. Prediction
 Classification:
 predicts categorical class labels (discrete or nominal)

 classifies data (constructs a model) based on the

training set and the values (class labels) in a

classifying attribute and uses it in classifying new data
 Prediction:
 models continuous-valued functions, i.e., predicts

unknown or missing values

 Typical Applications
 credit approval

 target marketing

 medical diagnosis

 treatment effectiveness analysis

2
Classification—A Two-Step
Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined

class, as determined by the class label attribute

 The set of tuples used for model construction is training set

 The model is represented as classification rules, decision

trees, or mathematical formulae

 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model


The known label of test sample is compared with the
classified result from the model

Accuracy rate is the percentage of test set samples that
are correctly classified by the model

Test set is independent of training set, otherwise over-
fitting will occur
 If the accuracy is acceptable, use the model to classify

data tuples whose class labels are not known

3
Classification Process (1):
Model Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
4
Classification Process (2): Use
the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
5
Supervised vs. Unsupervised
Learning
 Supervised learning (classification)

Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the observations

New data is classified based on the training set
 Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations,
etc. with the aim of establishing the existence
of classes or clusters in the data
August 5, 2025 Data Mining: Concepts and Techniques 6
Issues Regarding Classification and
Prediction (1): Data Preparation

 Data cleaning
 Preprocess data in order to reduce noise and
handle missing values
 Relevance analysis (feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data

August 5, 2025 Data Mining: Concepts and Techniques 7

Issues regarding classification and
prediction (2): Evaluating Classification
Methods
 Predictive accuracy
 Speed and scalability

time to construct the model

time to use the model
 Robustness

handling noise and missing values
 Scalability

efficiency in disk-resident databases
 Interpretability:

understanding and insight provided by the model
 Goodness of rules

decision tree size

compactness of classification rules
August 5, 2025 Data Mining: Concepts and Techniques 8
Bayesian Classification: Why?
 Probabilistic learning: Calculate explicit probabilities
for hypothesis, among the most practical
approaches to certain types of learning problems
 Incremental: Each training example can
incrementally increase/decrease the probability that
a hypothesis is correct. Prior knowledge can be
combined with observed data.
 Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured
August 5, 2025 Data Mining: Concepts and Techniques 9
Bayesian Theorem: Basics

 Let X be a data sample whose class label is

unknown
 Let H be a hypothesis that X belongs to class C
 For classification problems, determine P(H/X): the
probability that the hypothesis holds given the
observed data sample X
 P(H): prior probability of hypothesis H (i.e. the
initial probability before we observe any data,
reflects the background knowledge)
 P(X): probability that sample data is observed
 P(X|H) : probability of observing the sample X,
given that the hypothesis holds
August 5, 2025 Data Mining: Concepts and Techniques 10
Bayesian
Theorem

 Given training data X, posteriori probability of a

 Practical difficulty: require initial knowledge of

many probabilities, significant computational cost
August 5, 2025 Data Mining: Concepts and Techniques 11
Naïve Bayes Classifier

 A simplified assumption: attributes are

conditionally independent:
n
P( X | C i)   P( x k | C i)
k 1
 The product of occurrence of say 2 elements x1 and
x2, given the current class is C, is the product of
the probabilities of each element taken separately,
given the same class P([y1,y2],C) = P(y1,C) * P(y2,C)
 No dependence relation between attributes
 Greatly reduces the computation cost, only count
the class distribution.
 Once the probability P(X|Ci) is known, assign X to
the class with maximum P(X|Ci)*P(Ci)
August 5, 2025 Data Mining: Concepts and Techniques 12
Training dataset

age income student credit_rating buys_computer

Class: <=30 high no fair no
C1:buys_computer= <=30 high no excellent no
‘yes’ 30…40 high no fair yes
C2:buys_computer= >40 medium no fair yes
‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample 31…40 low yes excellent yes
X =(age<=30, <=30 medium no fair no
Income=medium, <=30 low yes fair yes
Student=yes >40 medium yes fair yes
Credit_rating= <=30 medium yes excellent yes
Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

August 5, 2025 Data Mining: Concepts and Techniques 13

Naïve Bayesian Classifier:
Example
 Compute P(X/Ci) for each class

P(age=“<30” | buys_computer=“yes”) = 2/9=0.222

P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4

X=(age<=30 ,income =medium, student=yes,credit_rating=fair)

P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044

P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“yes”) *
P(buys_computer=“yes”)=0.007

X belongs to class “buys_computer=yes”

August 5, 2025 Data Mining: Concepts and Techniques 14
Naïve Bayesian Classifier:
Comments

 Advantages :
 Easy to implement

 Good results obtained in most of the cases

 Disadvantages
 Assumption: class conditional independence , therefore

loss of accuracy
 Practically, dependencies exist among variables

 E.g., hospitals: patients: Profile: age, family history etc

Symptoms: fever, cough etc., Disease: lung cancer,

diabetes etc
 Dependencies among these cannot be modeled by Naïve

Bayesian Classifier
 How to deal with these dependencies?
 Bayesian Belief Networks
August 5, 2025 Data Mining: Concepts and Techniques 15
Avoid Overfitting in
Classification
 Overfitting: An induced tree may overfit the training
data
 Too many branches, some may reflect anomalies

due to noise or outliers

 Poor accuracy for unseen samples

 Two approaches to avoid overfitting

 Prepruning: Halt tree construction early—do not

split a node if this would result in the goodness

measure falling below a threshold

Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown”

tree—get a sequence of progressively pruned trees


Use a set of data different from the training data
to decide which is the “best pruned tree”
16
The k-Nearest Neighbor
Algorithm
 All instances correspond to points in the n-D
space.
 The nearest neighbor are defined in terms of
Euclidean distance.
 The target function could be discrete- or real-
valued.
 For discrete-valued, the k-NN returns the most
common value among the k training examples
nearest to xq.
 Vonoroi diagram: the decision surface induced
_
by 1-NN for_ a _typical set of training examples.
_
.
+
_ .
+
xq + . . .
August 5, 2025
_
+ .
Data Mining: Concepts and Techniques 17
Discussion on the k-NN
Algorithm
 The k-NN algorithm for continuous-valued target
functions
 Calculate the mean values of the k nearest neighbors

 Distance-weighted nearest neighbor algorithm

 Weight the contribution of each of the k neighbors

according to their distance to the query point xq

w 1

giving greater weight to closer neighbors d ( x , x )2
q i
 Similarly, for real-valued target functions
 Robust to noisy data by averaging k-nearest neighbors
 Curse of dimensionality: distance between neighbors
could be dominated by irrelevant attributes.
 To overcome it, axes stretch or elimination of the least

relevant attributes.
August 5, 2025 Data Mining: Concepts and Techniques 18
Classifier Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

ample of Confusion Matrix:

Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

Given m classes, an entry, CMi,j in a confusion matrix
indicates # of tuples in class i that were labeled by the
classifier as class j
 May have extra rows/columns to provide totals
19
Accuracy, Error Rate, Sensitivity
and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P  One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

 Classifier Accuracy, or negative class and minority of

recognition rate: percentage the positive class
of test set tuples that are  Sensitivity: True Positive
correctly classified
recognition rate
Accuracy = (TP + TN)/All 
Sensitivity = TP/P
 Error rate: 1 – accuracy, or
 Specificity: True Negative
Error rate = (FP + FN)/All
recognition rate

Specificity = TN/N
20

Chapter 6 - : Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
No ratings yet
Chapter 6 - : Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
129 pages
Module 3 - Classification
No ratings yet
Module 3 - Classification
111 pages
Unit V Classification
No ratings yet
Unit V Classification
69 pages
Lecture 10
No ratings yet
Lecture 10
53 pages
DM Ch6 (Classification and Prediction)
No ratings yet
DM Ch6 (Classification and Prediction)
39 pages
Classification 2
No ratings yet
Classification 2
19 pages
Chapter3 Classification and Prediction
No ratings yet
Chapter3 Classification and Prediction
63 pages
Data Mining 4th Is
No ratings yet
Data Mining 4th Is
24 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
Classification - Prediction Data Model Very Important
No ratings yet
Classification - Prediction Data Model Very Important
173 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
115 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Chapter4 Classification Prediction
No ratings yet
Chapter4 Classification Prediction
173 pages
Chapter 5. Classification and Prediction
No ratings yet
Chapter 5. Classification and Prediction
122 pages
Lecture 3.1.3 3.1.4
No ratings yet
Lecture 3.1.3 3.1.4
24 pages
Lecture 3.1.1
No ratings yet
Lecture 3.1.1
17 pages
Classification and Prediction
No ratings yet
Classification and Prediction
134 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
172 pages
7 Class
No ratings yet
7 Class
72 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
139 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
Classification in Data Mining
No ratings yet
Classification in Data Mining
14 pages
Classification & Prediction
No ratings yet
Classification & Prediction
19 pages
Classification
No ratings yet
Classification
50 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
WK 6 Nearest Neighbor Classifier and Bayesian Classifier 12-05-2021
No ratings yet
WK 6 Nearest Neighbor Classifier and Bayesian Classifier 12-05-2021
23 pages
Classification & Prediction in Data Mining
No ratings yet
Classification & Prediction in Data Mining
112 pages
Classification
No ratings yet
Classification
36 pages
Data Mining Basics for Beginners
No ratings yet
Data Mining Basics for Beginners
20 pages
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
No ratings yet
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
23 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
72 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
7 Class
No ratings yet
7 Class
72 pages
Data Mining: Concepts and Applications
No ratings yet
Data Mining: Concepts and Applications
11 pages
Classfication and Prediction
No ratings yet
Classfication and Prediction
133 pages
DM Classification 1 3
No ratings yet
DM Classification 1 3
19 pages
DM - Unit-1 - Fundamentals of Data Mining
No ratings yet
DM - Unit-1 - Fundamentals of Data Mining
43 pages
Chapter 7. Classification and Prediction
No ratings yet
Chapter 7. Classification and Prediction
68 pages
Chap 7
No ratings yet
Chap 7
71 pages
Classification & Prediction
No ratings yet
Classification & Prediction
78 pages
Chapter 6 Classification and Prediction25.10.13
No ratings yet
Chapter 6 Classification and Prediction25.10.13
43 pages
Probability and Statistics Mansoura Day4
No ratings yet
Probability and Statistics Mansoura Day4
23 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
55 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
88 pages
Aiml Unit-4
No ratings yet
Aiml Unit-4
82 pages
Data Mining: UNIT-3 Classification
No ratings yet
Data Mining: UNIT-3 Classification
54 pages
Chapter 5 Classification
No ratings yet
Chapter 5 Classification
24 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
88 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Evaluation of Student Academic Performan
No ratings yet
Evaluation of Student Academic Performan
7 pages
Classification Chapter 5
No ratings yet
Classification Chapter 5
26 pages
Classification Ppts 2021
No ratings yet
Classification Ppts 2021
80 pages
Classification and Prediction
No ratings yet
Classification and Prediction
130 pages
Unit-4 AML (1. Basics and K-NN)
No ratings yet
Unit-4 AML (1. Basics and K-NN)
25 pages
Unit 6 Classification and Prediction
No ratings yet
Unit 6 Classification and Prediction
66 pages
Clustering Validation
No ratings yet
Clustering Validation
4 pages
Brute Force Algorithm For Association Rules
No ratings yet
Brute Force Algorithm For Association Rules
2 pages
R-Codes SCS1621
No ratings yet
R-Codes SCS1621
151 pages
SBT1002
No ratings yet
SBT1002
198 pages
Mejora de La Clasificación de Residuos Plásticos y La Eficiencia Del Reciclaje Integración de Sensores de Imagen y Algoritmos de Aprendizaje Profundo
No ratings yet
Mejora de La Clasificación de Residuos Plásticos y La Eficiencia Del Reciclaje Integración de Sensores de Imagen y Algoritmos de Aprendizaje Profundo
17 pages
Cognizant
No ratings yet
Cognizant
15 pages
NYU Breast Cancer Dataset
No ratings yet
NYU Breast Cancer Dataset
9 pages
Validity in Music Information Research Experiments
No ratings yet
Validity in Music Information Research Experiments
24 pages
Housing Price Prediction Model Using Machine Learning
No ratings yet
Housing Price Prediction Model Using Machine Learning
4 pages
OceanofPDF - Com Python Machine Learning The Beginners Gu - Lilly Trinity
No ratings yet
OceanofPDF - Com Python Machine Learning The Beginners Gu - Lilly Trinity
115 pages
Python - Stdin, Stdout, and Stderr
No ratings yet
Python - Stdin, Stdout, and Stderr
20 pages
Ensemble Methods Vs Traditional ML Approaches An Empirical Analysis On Web Based Attack Detection in The Context of Industry 5-CHAKIR OUMAIMA
No ratings yet
Ensemble Methods Vs Traditional ML Approaches An Empirical Analysis On Web Based Attack Detection in The Context of Industry 5-CHAKIR OUMAIMA
23 pages
333 10week3
No ratings yet
333 10week3
5 pages
Recognizing Image Style
No ratings yet
Recognizing Image Style
10 pages
CSE Women's Midterm: DLT Focus
No ratings yet
CSE Women's Midterm: DLT Focus
2 pages
A Real Time Research Project Report On
No ratings yet
A Real Time Research Project Report On
30 pages
Techniques To Evaluate Accuracy of Classifier in Data Mining
No ratings yet
Techniques To Evaluate Accuracy of Classifier in Data Mining
2 pages
Khadija Zainab SRS Final
No ratings yet
Khadija Zainab SRS Final
30 pages
Fake Review Detection
No ratings yet
Fake Review Detection
27 pages
AI Crop Advice for Tamil Nadu Farmers
No ratings yet
AI Crop Advice for Tamil Nadu Farmers
18 pages
Prediction of Comorbid
No ratings yet
Prediction of Comorbid
13 pages
Ahmad Et Al. - 2023
No ratings yet
Ahmad Et Al. - 2023
19 pages
YouTube Video Summariser Using NLP
No ratings yet
YouTube Video Summariser Using NLP
27 pages
Final Review Paper 1
No ratings yet
Final Review Paper 1
19 pages
ASTRAL - Automated Safety Testing of Large Language Models
No ratings yet
ASTRAL - Automated Safety Testing of Large Language Models
11 pages
Kanishka's Logbook
No ratings yet
Kanishka's Logbook
20 pages
Waste Classifier Major Project
No ratings yet
Waste Classifier Major Project
25 pages
ChurnNet Deep Learning Enhanced Customer Churn Prediction in Telecommunication Industry
No ratings yet
ChurnNet Deep Learning Enhanced Customer Churn Prediction in Telecommunication Industry
14 pages
House Price Prediction Report
No ratings yet
House Price Prediction Report
63 pages
Application of KRR, K-NN and GPR Algorithms
No ratings yet
Application of KRR, K-NN and GPR Algorithms
27 pages
Evaluating Machine Learning Classification For Financial Trading
No ratings yet
Evaluating Machine Learning Classification For Financial Trading
15 pages
Proposal Latex Template BE
No ratings yet
Proposal Latex Template BE
23 pages
Bansal 2023 Eng. Res. Express 5 025079
No ratings yet
Bansal 2023 Eng. Res. Express 5 025079
11 pages
Transformer-Based Cross-Modal Recipe Embeddings With Large Batch Training
No ratings yet
Transformer-Based Cross-Modal Recipe Embeddings With Large Batch Training
12 pages

Classification

Uploaded by

Classification

Uploaded by

Classification and Prediction

 What is classification? What is prediction?

 classifies data (constructs a model) based on the

training set and the values (class labels) in a

unknown or missing values

 treatment effectiveness analysis

class, as determined by the class label attribute

 The model is represented as classification rules, decision

trees, or mathematical formulae

data tuples whose class labels are not known

NAME RANK YEARS TENURED Classifier

August 5, 2025 Data Mining: Concepts and Techniques 7

 Let X be a data sample whose class label is

 Given training data X, posteriori probability of a

 Practical difficulty: require initial knowledge of

 A simplified assumption: attributes are

age income student credit_rating buys_computer

August 5, 2025 Data Mining: Concepts and Techniques 13

P(age=“<30” | buys_computer=“yes”) = 2/9=0.222

X=(age<=30 ,income =medium, student=yes,credit_rating=fair)

P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044

X belongs to class “buys_computer=yes”

 Good results obtained in most of the cases

 E.g., hospitals: patients: Profile: age, family history etc

Symptoms: fever, cough etc., Disease: lung cancer,

due to noise or outliers

 Two approaches to avoid overfitting

split a node if this would result in the goodness

tree—get a sequence of progressively pruned trees

 Distance-weighted nearest neighbor algorithm

according to their distance to the query point xq

ample of Confusion Matrix:

 Classifier Accuracy, or negative class and minority of

You might also like