— Module 3 —
Classification
Course Outcomes
ITC602.4: Implement the appropriate data mining methods like
classification, clustering or Frequent Pattern mining on large data sets.
ITC602.5: Define and apply metrics to measure the performance of
various data mining algorithms.
1
Module 3. Classification
What is classification? What Rule-based classification,
is prediction? SVM, Backpropagation, k-
Issues regarding Nearest-Neighbor classifiers,
classification and prediction GA, fuzzy set
1. Classification by decision Prediction : structure of
tree induction regression models
Attribute Selection measures Simple linear regression
Tree pruning Accuracy and error measures
2. Bayesian classification Evaluating the Accuracy
Naive Bayes classifier
03/30/21 Data Mining: Concepts and Techniques 2
Classification vs. Prediction
Classification
predicts categorical class labels (discrete,unordered)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
Ex. Classification model to categorize bank loan applications as
either safe or risky
Prediction
models continuous-valued functions, i.e., predicts
unknown or missing values
Ex. Prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their
income and occupation
Data Mining: Concepts and Techniques
03/30/21 3
Typical applications
Credit approval/ Performance approval
Target marketing
Medical diagnosis
Manufacturing
Fraud detection
Data Mining: Concepts and
Techniques
03/30/21 4
Classification—A Two-Step Process
1. Learning Step / Training Phase :
Classifier is built describing a predetermined set of
classes/concepts
Classification algorithm builds the classifier by analyzing
or “learning from” a training set made up of database
tuples /sample and their associated class label attribute
The class label attribute is discrete-valued and unordered
The model is represented as classification rules, decision
trees, or mathematical formulae
Data Mining: Concepts and
03/30/21 Techniques 5
Process (1): Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier
Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Data Mining: Concepts and
03/30/21 Techniques 6
Classification—A Two-Step Process
2. Model usage: for classifying future or unknown data
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples
that are correctly classified by the model
Test set is independent of training set
If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known
Data Mining: Concepts and
03/30/21 Techniques 7
Process (2): Using the Model in
Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Data Mining: Concepts and
03/30/21 Techniques 8
Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations,
etc. with the aim of establishing the existence
of classes or clusters in the data
Data Mining: Concepts and
03/30/21 Techniques 9
Data Preparation
Data cleaning
Preprocess data in order to reduce noise and
handle missing values
Relevance analysis (Attribute/feature selection)
Remove the irrelevant or redundant attributes
Data transformation and reduction
Generalize and/or normalize data
Data Mining: Concepts and
03/30/21 Techniques 11
Comparing Classification and Prediction
Methods
Accuracy
classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted
attributes
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification rules
03/30/21 Data Mining: Concepts and Techniques 12
Module 3. Classification
What is classification? What Rule-based classification,
is prediction? SVM, Backpropagation, k-
Issues regarding Nearest-Neighbor classifiers,
classification and prediction GA, fuzzy set
1. Classification by decision Prediction : structure of
tree induction regression models
Attribute Selection meaures Simple linear regression
Tree pruning Multiple linear regression
2. Bayesian classification Accuracy and error measures
Naive Bayes classifier Evaluating the Accuracy
03/30/21 Data Mining: Concepts and Techniques 14
Decision Tree Induction: Training
Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Data Mining: Concepts and
03/30/21 Techniques 15
Output: A Decision Tree for
“buys_computer”
age?
<=30 overcast
31..40 >40
student? yes credit rating?
no yes excellent fair
no yes no yes
03/30/21 Data Mining: Concepts and Techniques 16
Algorithm for Decision Tree
Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-
conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
There are no samples left
03/30/21 Data Mining: Concepts and Techniques 17
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information
gain
Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a
m
tuple in D:
Info( D )=−∑ p i log 2 ( p i )
i=1
Information needed (after using A to split D into v
v
partitions) to classify D: |D j|
Info A ( D )=∑ ×I ( D j )
j= 1 |D|
Information gained by branching on attribute A
G ain ( A )= Info ( D )− Info A ( D )
03/30/21 Data Mining: Concepts and Techniques 18
Attribute Selection: Information Gain
Class P: buys_computer = “yes” 5 4
Class N: buys_computer = “no” Info age ( D )= I (2,3 )+ I ( 4,0)
14 14
9 9 5 5
Info( D )=I (9,5 )=− log 2 ( )− log 2 ( )=0 . 940 5
14 14 14 14 + I (3,2)=0 . 694
14
age pi n i I(p i, n i) 5
I (2,3 ) means “age <=30”
<=30 2 3 0.971 14
31…40 4 0 0 has 5 out of 14 samples,
>40 3 2 0.971 with 2 yes’es and 3 no’s.
age income student credit_rating buys_computer Hence
<=30 high no fair no G ain ( age )= Info ( D )− Info age ( D )= 0 . 246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no Similarly,
31…40 low yes excellent yes Gain( income )=0 . 029
<=30 medium no fair no
<=30 low yes fair yes Gain( student )=0 . 151
>40 medium yes fair yes
<=30 medium
31…40 medium
yes
no
excellent
excellent
yes
yes
Gain( credit )=0 . 048
rating
31…40 high yes fair yes
>4003/30/21
medium no excellent Data
no Mining: Concepts and Techniques 19
Computing Information-Gain for
Continuous-Value Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information
requirement for A is selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
03/30/21 Data Mining: Concepts and Techniques 20
Gain Ratio for Attribute Selection
(C4.5)
Information gain measure is biased towards
attributes with a large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome
the problem (normalization to information gain)
v
|D j| |D j|
SplitInfo A ( D)=−∑ ×log 2 ( )
j= 1 |D| |D|
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex. 4 4 6 6 4 4
SplitInfo A ( D)=− ×log 2 ( )− ×log 2 ( )− ×log 2 ( )=0 . 926
14 14 14 14 14 14
gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is selected
as the splitting attribute
03/30/21 Data Mining: Concepts and Techniques 21
Gini index (CART, IBM
IntelligentMiner)
If a data set D contains examples from n classes, gini index,
gini(D) is defined as n
gini( D )=1− p 2j ∑
j=1
where pj is the relative frequency of class j in D
If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D 1| |D 2|
gini A ( D )= gini( D 1 )+ gini ( D 2 )
|D| |D|
Reduction in Impurity:
Δgini ( A )= gini ( D )− gini A ( D )
The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
03/30/21 Data Mining: Concepts and Techniques 22
Gini index (CART, IBM
IntelligentMiner)
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
gini( D )=1− ( ) ( )
9 2 5 2
14
−
14
=0 .459
Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in Dgini
2 income∈ {low,medium } ( ( )
D)=
10
14
Gini( D ( )
1 )+
4
14
Gini( D2 )
but gini{medium,high} is 0.30 and thus the best since it is the
lowest
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split values
Can be modified for categorical attributes
23
Comparing Attribute Selection
Measures
The three measures, in general, return good results
but
Information gain:
biased towards multivalued attributes
Gain ratio:
tends to prefer unbalanced splits in which one
partition is much smaller than the others
Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized
03/30/21 partitions and purity
Data Mining:in both
Concepts and partitions
Techniques 24
Other Attribute Selection
Measures
CHAID: a popular decision tree algorithm, measure based on χ2
test for independence
C-SEP: performs better than info. gain and gini index in certain
cases
G-statistics: has a close approximation to χ2 distribution
MDL (Minimal Description Length) principle (i.e., the simplest
solution is preferred):
The best tree as the one that requires the fewest # of bits
to both (1) encode the tree, and (2) encode the exceptions
to the tree
Multivariate splits (partition based on multiple variable
combinations)
CART: finds multivariate splits based on a linear comb. of
03/30/21 attrs. Data Mining: Concepts and Techniques 25
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training
data
Too many branches, some may reflect anomalies due to noise
or outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early—do not split a node if
this would result in the goodness measure falling below a
threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get
a sequence of progressively pruned trees
Use a set of data different from the training data to decide
which is the “best pruned tree”
03/30/21 Data Mining: Concepts and Techniques 26
Enhancements to Basic Decision Tree
Induction
Allow for continuous-valued attributes
Dynamically define new discrete-valued
attributes that partition the continuous attribute
value into a discrete set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones
that are sparsely represented
This reduces fragmentation, repetition, and
03/30/21replication Data Mining: Concepts and
Techniques 27
Classification in Large Databases
Classification—a classical problem extensively
studied by statisticians and machine learning
researchers
Scalability: Classifying data sets with millions of
examples and hundreds of attributes with
reasonable speed
Why decision tree induction in data mining?
relatively faster learning speed (than other
classification methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
Data Mining: Concepts and
comparable classification
03/30/21 accuracy with other
Techniques 28
Scalable Decision Tree Induction
Methods
SLIQ (EDBT’96 — Mehta et al.)
Builds an index for each attribute and only class
list and the current attribute list reside in
memory
SPRINT (VLDB’96 — J. Shafer et al.)
Constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)
Integrates tree splitting and tree pruning: stop
growing the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan &
Ganti)
Builds an AVC-list (attribute, value, class label)
BOAT
03/30/21 (PODS’99 — Gehrke, Ganti, Ramakrishnan &
Data Mining: Concepts and
Techniques 29
Module 3. Classification
What is classification? What Rule-based classification,
is prediction? SVM, Backpropagation, k-
Issues regarding Nearest-Neighbor classifiers,
classification and prediction GA, fuzzy set
1. Classification by decision Prediction : structure of
tree induction regression models
Attribute Selection meaures Simple linear regression
Tree pruning Multiple linear regression
2. Bayesian classification Accuracy and error measures
Naive Bayes classifier Evaluating the Accuracy
03/30/21 Data Mining: Concepts and Techniques 38
Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
— prior knowledge can be combined with observed data
Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
Data Mining: Concepts and
03/30/21 Techniques 39
Bayesian Theorem: Basics
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
P(H) (prior probability), the initial probability
E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (posteriori probability), the probability of observing the
sample X, given that the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
Data Mining: Concepts and
03/30/21 Techniques 40
Bayesian Theorem
Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes theorem
P ( X |H ) P ( H )
P ( H |X )=
P( X )
Informally, this can be written as
posteriori = likelihood x prior/evidence
Predicts X belongs to C2 iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
Data Mining: Concepts and
03/30/21 Techniques 41
Towards Naïve Bayesian
Classifier
Let D be a training set of tuples and their
associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2,
…, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori,
i.e., the maximal P(Ci|X)
This can be derived from Bayes’ theorem
P( X|C i ) P( C i )
P( C i|X )=
P( X )
Since P(X) is constant for all classes, only
• needs to be maximized P ( C i|X )=P ( X |C i ) P ( C i )
Data Mining: Concepts and
03/30/21 Techniques 42
Derivation of Naïve Bayes
Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X|C i )=∏ P( x |Ci )=P( x |Ci )×P( x |C i )×. ..×P( x |C i )
This greatly reduces the computation cost: Only counts the class
k 1 2 n
k=1
distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value
xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
2
( x −μ )
−
1 2 σ2
and P(xk|Ci) is g( x,μ,σ )=
√2 π σ
e
P( X|C i )=g ( x k ,μ C ,σ C )i i
Data Mining: Concepts and
03/30/21 Techniques 43
Naïve Bayesian Classifier: Training
Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = >40 medium no fair yes
‘yes’ >40 low yes fair yes
C2:buys_computer = ‘no’ >40 low yes excellent no
31…40 low yes excellent yes
Data sample
<=30 medium no fair no
X = (age <=30,
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent
Data Mining: Concepts and
no
03/30/21 Techniques 44
Naïve Bayesian Classifier: An
Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
03/30/21 45
Avoiding the 0-Probability
Problem
Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X|C i )=∏ P( x k|C i )
k=1
Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10),
Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
The “corrected” prob. estimates are close to their
“uncorrected” counterparts
Data Mining: Concepts and
03/30/21 Techniques 46
Naïve Bayesian Classifier:
Comments
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence,
therefore loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
Dependencies among these cannot be modeled by
Naïve Bayesian Classifier
How to deal with these dependencies?
Data Mining: Concepts and
Bayesian Belief Networks
03/30/21 Techniques 47
Chapter 6. Classification and
Prediction
What is classification? What Support Vector Machines
is prediction? (SVM)
Issues regarding Associative classification
classification and prediction Lazy learners (or learning
Classification by decision from your neighbors)
tree induction Other classification methods
Bayesian classification Prediction
Rule-based classification Accuracy and error
Classification by back measures
propagation Ensemble methods
Data Mining: Concepts and
03/30/21 Model selection
Techniques 104
Module 3. Classification
What is classification? What Rule-based classification,
is prediction? SVM, Backpropagation, k-
Issues regarding Nearest-Neighbor classifiers,
classification and prediction GA, fuzzy set
1. Classification by decision Prediction : structure of
tree induction regression models
Attribute Selection meaures Simple linear regression
Tree pruning Multiple linear regression
2. Bayesian classification Accuracy and error measures
Naive Bayes classifier Evaluating the Accuracy
03/30/21 Data Mining: Concepts and Techniques 105
What Is Prediction?
(Numerical) prediction is similar to classification
construct a model
use model to predict continuous or ordered value for a given input
Prediction is different from classification
Classification refers to predict categorical class label
Prediction models continuous-valued functions
Major method for prediction: regression
model the relationship between one or more independent or predictor
variables and a dependent or response variable
Regression analysis
Linear and multiple regression
Non-linear regression
Other regression methods: generalized linear model, Poisson regression,
log-linear models, regression trees
Data Mining: Concepts and
03/30/21 Techniques 106
Linear Regression
Linear regression: involves a response variable y and a single
predictor variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
Method of least squares: estimates the best-fitting straight line
| D|
∑ ( x i − x̄ )( y i − ȳ )
w 1=
i= 1
| D| w 0 = ȳ −w 1 x̄
∑ ( x i − x̄ ) 2
i= 1
– Multiple linear regression: involves more than one predictor
variable
Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
Solvable by extension of least square method or using
statistical softwares/languages like SPSS, STATA, SAS, S-Plus,
R 107
Nonlinear Regression
Some nonlinear models can be modeled by a polynomial function
A polynomial regression model can be transformed into linear
regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
Other functions, such as power function, can also be transformed to
linear model
Some models are intractable nonlinear (e.g., sum of exponential
terms)
possible to obtain least square estimates through extensive
calculation on more complex formulae
108
Other Regression-Based Models
Generalized linear model:
Foundation on which linear regression can be applied to
modeling categorical response variables
Variance of y is a function of the mean value of y, not a
constant
Logistic regression: models the prob. of some event
occurring as a linear function of a set of predictor variables
Poisson regression: models the data that exhibit a Poisson
distribution
Log-linear models: (for categorical data)
Approximate discrete multidimensional prob. distributions
Also useful for data compression and smoothing
Regression trees and model trees
Trees to predict continuous values rather than class labels
03/30/21 109
Regression Trees and Model
Trees
Regression tree: proposed in CART system (Breiman et al.
1984)
CART: Classification And Regression Trees
Each leaf stores a continuous-valued prediction
It is the average value of the predicted attribute for the
training tuples that reach the leaf
Model tree: proposed by Quinlan (1992)
Each leaf holds a regression model—a multivariate linear
equation for the predicted attribute
A more general case than regression tree
Regression and model trees tend to be more accurate than
linear regression when the data are not represented well by a
110
Predictive Modeling in Multidimensional
Databases
Predictive modeling: Predict data values or construct generalized
linear models based on the database data
One can only predict value ranges or category distributions
Method outline:
Minimal generalization
Attribute relevance analysis
Generalized linear model construction
Prediction
Determine the major factors which influence the prediction
Data relevance analysis: uncertainty measurement, entropy
analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysis
111
Prediction: Numerical Data
112
Prediction: Categorical Data
113
Module 3. Classification
What is classification? What Rule-based classification,
is prediction? SVM, Backpropagation, k-
Issues regarding Nearest-Neighbor classifiers,
classification and prediction GA, fuzzy set
1. Classification by decision Prediction : structure of
tree induction regression models
Attribute Selection meaures Simple linear regression
Tree pruning Multiple linear regression
2. Bayesian classification Accuracy and error measures
Naive Bayes classifier Evaluating the Accuracy
03/30/21 Data Mining: Concepts and Techniques 115
Classifier Accuracy
Measures C1 C2
Accuracy of a classifier M, acc(M): C1 True False
percentage of test set tuples that are positive negative
correctly classified by the model M
Error rate (misclassification rate) of MC= 1 – False
acc(M) True
2
positive negative
Given m classes, CMi,j, an entry in a confusion matrix,
indicates # of tuples in class i that are labeled by the
classifier as class j
Alternative accuracy measures (e.g., for cancer diagnosis)
sensitivity = t-pos/pos /* true positive recognition rate */
specificity = t-neg/neg /* true negative recognition rate */
precision = t-pos/(t-pos + f-pos)
accuracy = sensitivity * pos/(pos + neg) + specificity *
neg/(pos + neg)
This model can also be used for cost-benefit analysis
03/30/21 116
Classifier Accuracy Measures
classes buy_computer = buy_computer total recognition(%)
yes = no
buy_computer = 6954 46 7000 99.34
yes
buy_computer = 412 2588 3000 86.27
no
total 7366 2634 10000 95.52
Data Mining: Concepts and
03/30/21 Techniques 117
Predictor Error Measures
Measure predictor accuracy: measure how far off the predicted
value is from the actual known value
Loss function: measures the error betw. yi and the predicted
value yi’
Absolute error: | yi – yi’|
Squared error: (yi – yi’)2
d d
Test error (generalization∑error):
|y i − y i '| the average loss over∑the y i ' )2
( y i − test
i=1 i=1
set d dd
d
Mean absolute error: ∑ | y i − y i '| Mean squared error: i=1 ∑ ( y i − y i ' ) 2
i= 1
d d
∑| y i − ȳ|
i=1
∑ ( y i − ȳ )2
i=1
Relative absolute error: Relative squared error:
The mean squared-error exaggerates the presence of outliers
Popularly use (square) root mean-square error, similarly, root 118
Module 3. Classification
What is classification? What Rule-based classification,
is prediction? SVM, Backpropagation, k-
Issues regarding Nearest-Neighbor classifiers,
classification and prediction GA, fuzzy set
1. Classification by decision Prediction : structure of
tree induction regression models
Attribute Selection meaures Simple linear regression
Tree pruning Multiple linear regression
2. Bayesian classification Accuracy and error measures
Naive Bayes classifier Evaluating the Accuracy
03/30/21 Data Mining: Concepts and Techniques 119
Evaluating the Accuracy of a
Classifier or Predictor (I)
Holdout method
Given data is randomly partitioned into two independent
sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
Random sampling: a variation of holdout
Repeat holdout k times, accuracy = avg. of the
accuracies obtained
Data Mining: Concepts and
03/30/21 Techniques 120
Evaluating the Accuracy of a
Classifier or Predictor (II)
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
At i-th iteration, use Di as test set and others as training
set
Leave-one-out: k folds where k = # of tuples, for small
sized data
Stratified cross-validation: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial
data
Data Mining: Concepts and
03/30/21 Techniques 121
Evaluating the Accuracy of a
Classifier or Predictor (III)
Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with
replacement
i.e., each time a tuple is selected, it is equally likely to
be selected again and re-added to the training set
Several boostrap methods, and a common one is .632
boostrap
Data Mining: Concepts and
03/30/21 Techniques 122
Evaluating the Accuracy of a
Classifier or Predictor (IV)
Several boostrap methods, and a common one is .632
boostrap
Suppose we are given a data set of d tuples. The data set is
sampled d times, with replacement, resulting in a training set of d
samples. The data tuples that did not make it into the training set
end up forming the test set. About 63.2% of the original data will
end up in the bootstrap, and the remaining 36.8% will form the
test set (since (1 – 1/d)d ≈ e-1 = 0.368)
Repeat the sampling procedue k times, overall accuracy of
the model:
k
acc ( M )=∑ (0 . 632×acc ( M i )test + 0 . 368×acc ( M i )train )
set set
i=1
Data Mining: Concepts and
03/30/21 Techniques 123
Module 3. Classification
What is classification? What Rule-based classification,
is prediction? SVM, Backpropagation, k-
Issues regarding Nearest-Neighbor classifiers,
classification and prediction GA, fuzzy set
1. Classification by decision Prediction : structure of
tree induction regression models
Attribute Selection meaures Simple linear regression
Tree pruning Multiple linear regression
2. Bayesian classification Accuracy and error measures
Naive Bayes classifier Evaluating the Accuracy
03/30/21 Summary
Data Mining: Concepts and Techniques 124
Summary (I)
Classification and prediction are two forms of data analysis that
can be used to extract models describing important data classes
or to predict future data trends.
Effective and scalable methods have been developed for decision
trees induction, Naive Bayesian classification, Bayesian belief
network, rule-based classifier, Backpropagation, Support Vector
Machine (SVM), associative classification, nearest neighbor
classifiers, and case-based reasoning, and other classification
methods such as genetic algorithms, rough set and fuzzy set
approaches.
Linear, nonlinear, and generalized linear models of regression can
be used for prediction. Many nonlinear problems can be
converted to linear problems by performing transformations on
the predictor variables. Regression trees and model trees are also
used for prediction. 133
Summary (II)
Stratified k-fold cross-validation is a recommended method for
accuracy estimation. Bagging and boosting can be used to
increase overall accuracy by learning and combining a series of
individual models.
Significance tests and ROC curves are useful for model selection
There have been numerous comparisons of the different
classification and prediction methods, and the matter remains a
research topic
No single method has been found to be superior over all others for
all data sets
Issues such as accuracy, training time, robustness, interpretability,
and scalability must be considered and can involve trade-offs,
further complicating the quest for an overall superior method 134
References (1)
C. Apte and S. Weiss. Data mining with decision trees and decision rules.
Future Generation Computer Systems, 13, 1997.
C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University
Press, 1995.
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and
Regression Trees. Wadsworth International Group, 1984.
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern
Recognition. Data Mining and Knowledge Discovery, 2(2): 121-168, 1998.
P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from
partitioned data for scaling machine learning. KDD'95.
W. Cohen. Fast effective rule induction. ICML'95.
G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule
groups for gene expression data. SIGMOD'05.
A. J. Dobson. An Introduction to Generalized Linear Models. Chapman
and Hall, 1990.
G. Dong and J. Li. Efficient mining of emerging patterns: Discovering
trends and differences. KDD'99.
Data Mining: Concepts and
Techniques 135
References (2)
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley and
Sons, 2001
U. M. Fayyad. Branching on attribute values in decision tree generation.
AAAI’94.
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. J. Computer and System Sciences, 1997.
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast
decision tree construction of large datasets. VLDB’98.
J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision
Tree Construction. SIGMOD'99.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Springer-Verlag, 2001.
D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and
decision tree induction: Efficient classification in data mining. RIDE'97.
B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule. KDD'98.
W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on
Multiple Class-Association Rules, ICDM'01.
136
References (3)
T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy,
complexity, and training time of thirty-three old and new classification
algorithms. Machine Learning, 2000.
J. Magidson. The Chaid approach to segmentation modeling: Chi-squared
automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods
of Marketing Research, Blackwell Business, 1994.
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for
data mining. EDBT'96.
T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
S. K. Murthy, Automatic Construction of Decision Trees from Data: A
Multi-Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-
389, 1998
J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
137
References (4)
R. Rastogi and K. Shim. Public: A decision tree classifier that integrates
building and pruning. VLDB’98.
J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for
data mining. VLDB’96.
J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan
Kaufmann, 1990.
P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison
Wesley, 2005.
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn:
Classification and Prediction Methods from Statistics, Neural Nets,
Machine Learning, and Expert Systems. Morgan Kaufman, 1991.
S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
X. Yin and J. Han. CPAR: Classification based on predictive association rules.
SDM'03
H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with
hierarchical clusters. KDD'03.
138