Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Prediction
Linear Regression
Non linear regression
1
Classification vs. Prediction
Classification:
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Prediction:
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
2
Classification—A Two-Step
Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that
are correctly classified by the model
Test set is independent of training set, otherwise over-
fitting will occur
If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known
3
Classification Process (1):
Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier
Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
4
Classification Process (2): Use
the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
5
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations,
etc. with the aim of establishing the existence
of classes or clusters in the data
August 5, 2025 Data Mining: Concepts and Techniques 6
Issues Regarding Classification and
Prediction (1): Data Preparation
Data cleaning
Preprocess data in order to reduce noise and
handle missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
August 5, 2025 Data Mining: Concepts and Techniques 7
Issues regarding classification and
prediction (2): Evaluating Classification
Methods
Predictive accuracy
Speed and scalability
time to construct the model
time to use the model
Robustness
handling noise and missing values
Scalability
efficiency in disk-resident databases
Interpretability:
understanding and insight provided by the model
Goodness of rules
decision tree size
compactness of classification rules
August 5, 2025 Data Mining: Concepts and Techniques 8
Bayesian Classification: Why?
Probabilistic learning: Calculate explicit probabilities
for hypothesis, among the most practical
approaches to certain types of learning problems
Incremental: Each training example can
incrementally increase/decrease the probability that
a hypothesis is correct. Prior knowledge can be
combined with observed data.
Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
Standard: Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured
August 5, 2025 Data Mining: Concepts and Techniques 9
Bayesian Theorem: Basics
Let X be a data sample whose class label is
unknown
Let H be a hypothesis that X belongs to class C
For classification problems, determine P(H/X): the
probability that the hypothesis holds given the
observed data sample X
P(H): prior probability of hypothesis H (i.e. the
initial probability before we observe any data,
reflects the background knowledge)
P(X): probability that sample data is observed
P(X|H) : probability of observing the sample X,
given that the hypothesis holds
August 5, 2025 Data Mining: Concepts and Techniques 10
Bayesian
Theorem
Given training data X, posteriori probability of a
hypothesis H, P(H|X) follows the Bayes theorem
P(H | X ) P( X | H )P(H )
P( X )
Informally, this can be written as
posterior =likelihood x prior / evidence
MAP (maximum posteriori) hypothesis
h arg max P(h | D) arg max P(D | h)P(h).
MAP hH hH
Practical difficulty: require initial knowledge of
many probabilities, significant computational cost
August 5, 2025 Data Mining: Concepts and Techniques 11
Naïve Bayes Classifier
A simplified assumption: attributes are
conditionally independent:
n
P( X | C i) P( x k | C i)
k 1
The product of occurrence of say 2 elements x1 and
x2, given the current class is C, is the product of
the probabilities of each element taken separately,
given the same class P([y1,y2],C) = P(y1,C) * P(y2,C)
No dependence relation between attributes
Greatly reduces the computation cost, only count
the class distribution.
Once the probability P(X|Ci) is known, assign X to
the class with maximum P(X|Ci)*P(Ci)
August 5, 2025 Data Mining: Concepts and Techniques 12
Training dataset
age income student credit_rating buys_computer
Class: <=30 high no fair no
C1:buys_computer= <=30 high no excellent no
‘yes’ 30…40 high no fair yes
C2:buys_computer= >40 medium no fair yes
‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample 31…40 low yes excellent yes
X =(age<=30, <=30 medium no fair no
Income=medium, <=30 low yes fair yes
Student=yes >40 medium yes fair yes
Credit_rating= <=30 medium yes excellent yes
Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
August 5, 2025 Data Mining: Concepts and Techniques 13
Naïve Bayesian Classifier:
Example
Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“yes”) *
P(buys_computer=“yes”)=0.007
X belongs to class “buys_computer=yes”
August 5, 2025 Data Mining: Concepts and Techniques 14
Naïve Bayesian Classifier:
Comments
Advantages :
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence , therefore
loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes etc
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks
August 5, 2025 Data Mining: Concepts and Techniques 15
Avoid Overfitting in
Classification
Overfitting: An induced tree may overfit the training
data
Too many branches, some may reflect anomalies
due to noise or outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early—do not
split a node if this would result in the goodness
measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
Use a set of data different from the training data
to decide which is the “best pruned tree”
16
The k-Nearest Neighbor
Algorithm
All instances correspond to points in the n-D
space.
The nearest neighbor are defined in terms of
Euclidean distance.
The target function could be discrete- or real-
valued.
For discrete-valued, the k-NN returns the most
common value among the k training examples
nearest to xq.
Vonoroi diagram: the decision surface induced
_
by 1-NN for_ a _typical set of training examples.
_
.
+
_ .
+
xq + . . .
August 5, 2025
_
+ .
Data Mining: Concepts and Techniques 17
Discussion on the k-NN
Algorithm
The k-NN algorithm for continuous-valued target
functions
Calculate the mean values of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
Weight the contribution of each of the k neighbors
according to their distance to the query point xq
w 1
giving greater weight to closer neighbors d ( x , x )2
q i
Similarly, for real-valued target functions
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors
could be dominated by irrelevant attributes.
To overcome it, axes stretch or elimination of the least
relevant attributes.
August 5, 2025 Data Mining: Concepts and Techniques 18
Classifier Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
ample of Confusion Matrix:
Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
Given m classes, an entry, CMi,j in a confusion matrix
indicates # of tuples in class i that were labeled by the
classifier as class j
May have extra rows/columns to provide totals
19
Accuracy, Error Rate, Sensitivity
and Specificity
A\P C ¬C Class Imbalance Problem:
C TP FN P One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
Significant majority of the
Classifier Accuracy, or negative class and minority of
recognition rate: percentage the positive class
of test set tuples that are Sensitivity: True Positive
correctly classified
recognition rate
Accuracy = (TP + TN)/All
Sensitivity = TP/P
Error rate: 1 – accuracy, or
Specificity: True Negative
Error rate = (FP + FN)/All
recognition rate
Specificity = TN/N
20